Back to browse
GitHub Repository

A memory layer that tracks evidence, claims, and decisions to make multi-turn LLM judges and reviewer agents more inspectable and stable.

2 starsPython

I made a small helper for checking model-graded answers

by ML0037·Jun 14, 2026·2 points·0 comments

AI Analysis

●●●BangerBig BrainSolve My ProblemDark Horse

Structurally verifies LLM judge reasoning instead of paying for a second model check.

Strengths
  • Structural verification flags ignored references without needing a second LLM judge call.
  • Cites specific research papers on judge bias to justify the design decisions clearly.
  • Local CLI viewer allows inspecting flagged runs immediately without cloud dashboard setup.
Weaknesses
  • Web dashboard mentioned in README appears unfinished compared to the local CLI viewer.
  • Requires modifying eval prompts to fit the claim-evidence structure, adding integration friction.
Category
Target Audience

AI researchers and ML engineers running LLM evaluations

Similar To

LangSmith · Arize Phoenix · DeepEval

Post Description

I made this while checking model graded answer and helped me to check the odd cases by hand. Not sure if it’s useful to anyone else.

TL;DR: it breaks an LLM judge run into claims->evidence->verdicts and flags when a verdict is not supported by the evidence, so i can check it manually.

Similar Projects

AI/ML●●Solid

DocForge – Multi-Agent RAG That Fact-Checks Its Own Answers

Multi-agent fact-checking loop, but RAG hallucination fixes are table stakes now.

Big BrainShip It
toheed11
114mo ago
AI/ML●●Solid

LLM Debate Benchmark

Side-swapped debate matchups expose model weaknesses standard benchmarks miss.

Big BrainDark Horse
zone411
932mo ago