100% LLM accuracy–no fine-tuning, JSON only
Ancient Rome Q&A benchmark shows 81pp accuracy lift, but lacks adversarial defense evidence.
Cultural AI benchmark demonstrating 100% accuracy
The repo ships a runnable eval_framework.py and a 20-question public sample (samples/sample_20q.jsonl) so you can reproduce the headline model comparisons locally. The claim — Triad Engine hits 100% vs Claude 4.6 at 0/45% — is eye-catching, but the full 222-question dataset and detailed methodology are gated behind an email request, which makes reproducibility and cherry-picking concerns the main barrier to taking the results seriously.
AI/NLP researchers, benchmarkers, prompt-engineers and developers building culturally grounded or multi-agent language systems
Benchmark proves cultural grounding: Triad 100% vs Claude 4.6 45% on 222q anachronism test.
Public: eval framework + 20 sample questions Gated: full research dataset (airtrek.ai/research)
Cultural intelligence that frontier models fail.
Feedback welcome!
Ancient Rome Q&A benchmark shows 81pp accuracy lift, but lacks adversarial defense evidence.
Names compression-step hallucination, but it's a paper not a tool you can use.
Open-source AI docs benchmark where the authors' own tool conveniently scored highest.
Git-like branching for columnar data with DuckDB-beating benchmarks from pure JVM.
Fourteen parallel Claude agents grade your startup idea's evidence before you quit your job.
Interactive DuckDB-WASM benchmark beats static leaderboards for agentic SQL eval.