Back to browse
GitHub Repository

Cultural AI benchmark demonstrating 100% accuracy

1 starsPython

Triad Engine beats Claude 4.6 (100% vs. 45%) on Rome cultural benchmark

by MysticBirdie·Feb 15, 2026·1 point·2 comments

AI Analysis

MidNiche GemBold Bet
The Take

The repo ships a runnable eval_framework.py and a 20-question public sample (samples/sample_20q.jsonl) so you can reproduce the headline model comparisons locally. The claim — Triad Engine hits 100% vs Claude 4.6 at 0/45% — is eye-catching, but the full 222-question dataset and detailed methodology are gated behind an email request, which makes reproducibility and cherry-picking concerns the main barrier to taking the results seriously.

Category
Target Audience

AI/NLP researchers, benchmarkers, prompt-engineers and developers building culturally grounded or multi-agent language systems

Post Description

Live MVP at airtrek.ai - Ancient Rome therapist with 3 voiced characters (senator, noblewoman, merchant).

Benchmark proves cultural grounding: Triad 100% vs Claude 4.6 45% on 222q anachronism test.

Public: eval framework + 20 sample questions Gated: full research dataset (airtrek.ai/research)

Cultural intelligence that frontier models fail.

Feedback welcome!

Similar Projects

AI/MLMid

100% LLM accuracy–no fine-tuning, JSON only

Ancient Rome Q&A benchmark shows 81pp accuracy lift, but lacks adversarial defense evidence.

Big Brain
MysticBirdie
223mo ago