Back to browse
GitHub Repository

A memory layer that tracks evidence, claims, and decisions to make multi-turn LLM judges and reviewer agents more inspectable and stable.

3 starsPython

I made a small helper for checking model-graded answers

by ML0037·Jun 18, 2026·1 point·0 comments

AI Analysis

●●SolidBig BrainNiche Gem

Flags LLM judge verdicts unsupported by evidence without needing a second model.

Strengths
  • Breaks judge runs into claims-evidence-verdicts chains for manual inspection.
  • Detects position bias, verbosity bias, and rubric coverage gaps automatically.
  • Local CLI viewer flags problematic verdicts without adding inference costs.
Weaknesses
  • Niche audience limits adoption to researchers doing serious LLM eval work.
  • Web dashboard mentioned but not yet implemented in the current release.
Category
Target Audience

AI researchers and engineers running LLM evaluation pipelines

Similar To

Arize Phoenix · LangSmith · Braintrust

Post Description

I made this while checking model graded answer and helped me to check the odd cases by hand. Not sure if it’s useful to anyone else.

TL;DR: it breaks an LLM judge run into claims->evidence->verdicts and flags when a verdict is not supported by the evidence, so i can check it manually

Similar Projects

AI/ML●●●Banger

I made a small helper for checking model-graded answers

Structurally verifies LLM judge reasoning instead of paying for a second model check.

Big BrainSolve My ProblemDark Horse
ML0037
204d ago
AI/ML●●Solid

DocForge – Multi-Agent RAG That Fact-Checks Its Own Answers

Multi-agent fact-checking loop, but RAG hallucination fixes are table stakes now.

Big BrainShip It
toheed11
114mo ago