Back to browse
GitHub Repository

Benchmark measuring historical accuracy of AI-generated images. 24 image pairs (3 characters × 8 scenes) set in Rome 110 CE, comparing naive prompts vs culturally-grounded prompts. Blinded A/B evaluation shows structured knowledge injection produces 5x more historically accurate images. Includes prompts, evaluation rubric, and reproducible pipeline

2 starsPython

AI image models hallucinate history, we built a method to fix it it

by MysticBirdie·Mar 9, 2026·1 point·2 comments

AI Analysis

●●●BangerBig BrainWizardrySolve My Problem

Naive prompts hallucinate history; structured knowledge injection raises accuracy from 12.5% to 83.3%.

Strengths
  • Blinded evaluation methodology eliminates judge bias; Gemini 2.0 Flash independently scores without knowing ground truth.
  • Fully reproducible pipeline with all 48 images, prompts, and evaluation code open source; easy to verify results.
  • Precise insight for practitioners: image models drop unrecognized historical terms; visual translation outperforms terminology.
Weaknesses
  • Limited scope: only 24 prompts across one historical domain (Rome 110 CE); generalization to other periods unclear.
  • Triad Engine framework not included; only schema example provided, limiting immediate reproducibility of prompt grounding method.
Category
Target Audience

AI researchers, prompt engineers, computer vision teams

Similar To

Anthropic's Evals framework · OpenAI's benchmarks for prompt engineering quality

Post Description

We created 24 image prompts across 3 characters living in Rome, 110 CE. Each prompt has a naive version and a culturally-grounded version enhanced by the Triad Engine (structured domain knowledge injection). Same model, same pipeline, only the prompt changes. A blinded Gemini Vision judge scores each pair without knowing which is which.

Results:

RAW (naive prompt): 12.5% historically accurate TRIAD (grounded prompt): 83.3% historically accurate In 23 of 24 pairs, the grounded image was judged more accurate In 0 of 24 pairs was the naive image judged better The key insight for prompt engineers: image models silently drop historical terms they don't recognize. "dextrarum iunctio handshake" produces nothing useful. "two men clasping right hands wrist-to-wrist, elbows raised" works. Visual translation, not historical terminology.

The full benchmark — all 48 images, prompts, evaluation data, and reproducible pipeline — is open source. You can re-run the blinded evaluation yourself with a free Gemini API key.

Repo: https://github.com/Mysticbirdie/image-cultural-accuracy-benc...

Paper: https://github.com/Mysticbirdie/image-cultural-accuracy-benc...

Similar Projects

AI/ML●●●Banger

Legal RAG Bench

Legal RAG benchmark revealing embedding quality > LLM choice by 19-point margin.

Big BrainNiche GemSolve My Problem
beowa
413mo ago
AI/ML●●Solid

Agentic Intent Benchmark

First benchmark testing structured requirements on complex greenfield agent tasks.

Niche GemBig Brain
ryan4rtmx
2016d ago