GitHub Repository

Cultural AI benchmark demonstrating 100% accuracy

1 starsPython

Triad Engine beats Claude 4.6 (100% vs. 45%) on Rome cultural benchmark

Name: Triad Engine beats Claude 4.6 (100% vs. 45%) on Rome cultural benchmark
Availability: InStock
Author: MysticBirdie

by MysticBirdie·Feb 15, 2026·1 point·2 comments

Visit Project View on HN

AI Analysis

●MidNiche GemBold Bet

The Take

The repo ships a runnable eval_framework.py and a 20-question public sample (samples/sample_20q.jsonl) so you can reproduce the headline model comparisons locally. The claim — Triad Engine hits 100% vs Claude 4.6 at 0/45% — is eye-catching, but the full 222-question dataset and detailed methodology are gated behind an email request, which makes reproducibility and cherry-picking concerns the main barrier to taking the results seriously.

Post Description

Live MVP at airtrek.ai - Ancient Rome therapist with 3 voiced characters (senator, noblewoman, merchant).

Benchmark proves cultural grounding: Triad 100% vs Claude 4.6 45% on 222q anachronism test.

Public: eval framework + 20 sample questions Gated: full research dataset (airtrek.ai/research)

Cultural intelligence that frontier models fail.

Feedback welcome!

Similar Projects

AI/ML●Mid

100% LLM accuracy–no fine-tuning, JSON only

Ancient Rome Q&A benchmark shows 81pp accuracy lift, but lacks adversarial defense evidence.

Big Brain

MysticBirdie

223mo ago

AI/ML●Mid

Do Thought Streams Matter? A Benchmark of VLM Reasoning in Gemini 2.5

Names compression-step hallucination, but it's a paper not a tool you can use.

Big BrainNiche Gem

ashu_trv

301mo ago

Developer Tools●●Solid

We beat Google, Cognition, Claude Code at codebase docs generation

Open-source AI docs benchmark where the authors' own tool conveniently scored highest.

Big Brain

curious_nile

231mo ago

Data●●●Banger

Stratum – SQL that branches and beats DuckDB on 35/46 1T benchmarks

Git-like branching for columnar data with DuckDB-beating benchmarks from pure JVM.

Big BrainWizardry

whilo

1233mo ago

AI/ML●●Solid

TweakIdea – 14-dimension startup idea evaluation in Claude Code

Fourteen parallel Claude agents grade your startup idea's evidence before you quit your job.

Big BrainNiche Gem

ephx

101mo ago

AI/ML●●Solid

An Interactive Text to SQL Agent Benchmark

Interactive DuckDB-WASM benchmark beats static leaderboards for agentic SQL eval.

Big BrainNiche Gem

102mo ago