Back to browse
We proved you can't train hallucinations out of AI – so we verify

We proved you can't train hallucinations out of AI – so we verify

by assaydev·Feb 18, 2026·1 point·0 comments

AI Analysis

●●●BangerBig BrainZero to One

Training can't fix hallucinations—verify externally instead. Backed by surprising RLVF data.

Strengths
  • Rigorous empirical finding (RLVF collapse with more training data) reframes the whole problem
  • Claim extraction + adversarial verification is genuinely novel—not another LLM-as-judge scorer
  • Real validation against test suites instead of circular LLM evaluation
Weaknesses
  • GitHub 404 page shown; unclear if repo is live or project is abandoned
  • Needs live demo or real integration examples to prove effectiveness beyond the thesis
Target Audience

Engineers shipping AI-assisted code who want hallucination detection

Similar To

Copilot guardrails · Code2Prompt verification modes

Post Description

Hi HN, I'm Ty. I built Assay because I got tired of shipping bugs that AI hallucinated into my code and no tool caught.

The starting point was a finding that surprised me: when we tried training verification directly into models using RLVF (Reinforcement Learning from Verification Feedback), more training data made the model worse. 120 curated pairs hit 91.5% accuracy. 2,000 pairs collapsed to 77.4%. The model's training loss kept decreasing while eval performance cratered. This isn't a tuning problem. Verification cannot be internalized.

So we built an external layer. Assay extracts the implicit claims code makes ("this handles null input," "this query is injection-safe," "this validates auth tokens") and verifies each one against the actual implementation. It's not a linter, not another LLM-as-judge — it's structured claim extraction followed by adversarial verification.

Results validated against real test suites (not LLM judgment): - HumanEval: 100% pass@5 (164/164) — baseline was 86.6% - SWE-bench: 30.3% (91/300) vs 18.3% baseline — +65.5% - LVR pilot: Found 23 real bugs (2 critical) in a production ERP system, verified 354 claims - LLM-as-judge actually regresses at k=5 (97.2% vs our 100%) because it hallucinates false positives

Ships as a GitHub Action for PR verification, or try it: npx tryassay assess /path/to/your/project

Public repo (the URL above points to our private research repo): https://github.com/gtsbahamas/assay-verify

GitHub Action: uses: gtsbahamas/assay-verify/github-action@main

Paper: https://doi.org/10.5281/zenodo.18522644

Similar Projects