GitHub Repository

Benchmark measuring historical accuracy of AI-generated images. 24 image pairs (3 characters × 8 scenes) set in Rome 110 CE, comparing naive prompts vs culturally-grounded prompts. Blinded A/B evaluation shows structured knowledge injection produces 5x more historically accurate images. Includes prompts, evaluation rubric, and reproducible pipeline

2 starsPython

AI image models hallucinate history, we built a method to fix it it

Name: AI image models hallucinate history, we built a method to fix it it
Availability: InStock
Author: MysticBirdie

by MysticBirdie·Mar 9, 2026·1 point·2 comments

Visit Project View on HN

AI Analysis

●●●BangerBig BrainWizardrySolve My Problem

Naive prompts hallucinate history; structured knowledge injection raises accuracy from 12.5% to 83.3%.

Strengths

•Blinded evaluation methodology eliminates judge bias; Gemini 2.0 Flash independently scores without knowing ground truth.
•Fully reproducible pipeline with all 48 images, prompts, and evaluation code open source; easy to verify results.
•Precise insight for practitioners: image models drop unrecognized historical terms; visual translation outperforms terminology.

Weaknesses

•Limited scope: only 24 prompts across one historical domain (Rome 110 CE); generalization to other periods unclear.
•Triad Engine framework not included; only schema example provided, limiting immediate reproducibility of prompt grounding method.

Post Description

We created 24 image prompts across 3 characters living in Rome, 110 CE. Each prompt has a naive version and a culturally-grounded version enhanced by the Triad Engine (structured domain knowledge injection). Same model, same pipeline, only the prompt changes. A blinded Gemini Vision judge scores each pair without knowing which is which.

Results:

RAW (naive prompt): 12.5% historically accurate TRIAD (grounded prompt): 83.3% historically accurate In 23 of 24 pairs, the grounded image was judged more accurate In 0 of 24 pairs was the naive image judged better The key insight for prompt engineers: image models silently drop historical terms they don't recognize. "dextrarum iunctio handshake" produces nothing useful. "two men clasping right hands wrist-to-wrist, elbows raised" works. Visual translation, not historical terminology.

The full benchmark — all 48 images, prompts, evaluation data, and reproducible pipeline — is open source. You can re-run the blinded evaluation yourself with a free Gemini API key.

Repo: https://github.com/Mysticbirdie/image-cultural-accuracy-benc...

Paper: https://github.com/Mysticbirdie/image-cultural-accuracy-benc...