Back to browse
GitHub Repository

Multi-tier benchmark: Cultural grounding + Triad Engine eliminates LLM hallucination across Claude 4.6, GPT-5.2, Mistral 7B, Gemini 2.5 Pro. Raw 15-58% → 95-100% accuracy on 222 adversarial QA pairs (Ancient Rome 110 CE). Novel topological paradox detection (F1=0.939, zero-shot). Model-agnostic, in production.

6 starsPython

100% LLM accuracy–no fine-tuning, JSON only

by MysticBirdie·Feb 25, 2026·2 points·2 comments

AI Analysis

MidBig Brain

Ancient Rome Q&A benchmark shows 81pp accuracy lift, but lacks adversarial defense evidence.

Strengths
  • Rigorous benchmark methodology: dual independent judges, 222 adversarial questions across 5 categories
  • Zero regressions: claims no case where grounding degrades correct answers
  • Model-agnostic approach works across Mistral, GPT, Claude, Gemini with single JSON layer
Weaknesses
  • Domain guide is opaque: repository shows results but not the actual Rome grounding JSON
  • Benchmark is narrow (Ancient Rome only): generalization to other domains unproven; adversarial robustness vs well-crafted jailbreaks untested
Category
Target Audience

AI researchers, prompt engineers, teams building domain-specific LLM systems

Similar To

Langchain prompt engineering · Semantic kernel · RAG systems with domain grounding

Post Description

# Show HN: 100% LLM accuracy—no fine-tuning, JSON only

*GitHub:* https://github.com/Mysticbirdie/hallucination-elimination-be... *Paper:* https://github.com/Mysticbirdie/hallucination-elimination-be...

---

*Model-agnostic Triad Engine*: JSON domain guide → 100% accuracy across Mistral 7B/Claude/GPT—no fine-tuning.

---

## Results (222 questions, adversarial domain, Gemini 2.0 Flash judge)

| Model | Raw | + Triad | ∆ | |-------|-----|---------|---| | Mistral 7B (local) | 22.5% | *99.5%* | +77pp | | Bielik 11B (local) | 21.6% | *88.7%* | +67.1pp | | GPT-5.2 | 26.1% | *100%* | +73.9pp | | Gemini 2.5 Pro | 42.3% | *95%* | +52.7pp | | Claude 4.6 | 45.0% | *100%* | +55pp | | Perplexity Sonar (RAG) | 64.4% | *93.7%* | +29.3pp |

Perplexity has live web search—Triad still adds 29.3pp. Gemini 2.0 judge (cross-model). Claude Opus (stricter): raw Claude 14.9% → Triad 95.9%. Zero regressions.

*Beyond accuracy:* - Adversarial: Raw Claude accepts false premises 25% → Triad 5% - Consistency: Raw Claude 0% agreement across personas → Triad 100% - Concise: 2.1× shorter responses (473 vs 1,015 chars)

*Real-world (Windsurf live codebase):* | Phase | Context | Score | |-------|---------|-------| | No context | — | 40% | | Unstructured docs | — | 40% | | Triad JSON guide | — | *100%* |

---

## Triad Engine Multi-voice layer above any LLM: - λ: Character voice from domain guide - μ: Truth/false enforcement - ν: User calibration - ω: Voice compositor

*Only input*: JSON domain guide (what exists/doesn't, agents, norms). No weight changes.

Works with any LLM. Applies to medical, legal, compliance domains.

---

## Open source 222 questions, runners (Claude/GPT/Gemini/Mistral), JSON results, guide schema in repo.

Happy to answer technical questions.

Similar Projects