Back to browse
GitHub Repository

A memory layer that tracks evidence, claims, and decisions to make multi-turn LLM judges and reviewer agents more inspectable and stable.

3 starsPython

I built a small audit layer for LLM-as-judge decisions

by ML0037·Jun 26, 2026·2 points·0 comments

AI Analysis

●●SolidNiche GemBig Brain

Flags unsupported LLM judge verdicts by tracing claims back to evidence.

Strengths
  • Breaks judge reasoning into explicit claim-evidence graphs.
  • Local viewer flags missing evidence or ignored rubrics.
  • Addresses known LLM judge bias without adding another model.
Weaknesses
  • Niche audience limits broader developer appeal.
  • Eval infrastructure space is getting crowded with established tools.
Category
Target Audience

AI researchers, ML engineers

Similar To

LangSmith · Arize Phoenix · Ragas

Post Description

I made this while checking model graded answer and helped me to check the odd cases by hand. Not sure if it’s useful to anyone else.

TL;DR: it breaks an LLM judge run into claims->evidence->verdicts and flags when a verdict is not supported by the evidence, so i can check it manually.

Similar Projects

AI/ML●●Solid

I made a small helper for checking model-graded answers

Flags LLM judge verdicts unsupported by evidence without needing a second model.

Big BrainNiche Gem
ML0037
107d ago

OpenRevise is the Harvey for all industries

The repo nails the governance bits: MECE decomposition, a strict source‑gate, and JSON patch specs so changes are only made when verifiable fulltext exists. It emits true DOCX tracked edits and a Q→source audit mapping — exactly the kind of deterministic audit trail regulated teams want — but the project is still early (few stars, light demos) and it’s unclear how it integrates with verification or LLM orchestration out of the box.

Niche GemSolve My Problem
alfredray
304mo ago