I made a small helper for checking model-graded answers
Flags LLM judge verdicts unsupported by evidence without needing a second model.
A memory layer that tracks evidence, claims, and decisions to make multi-turn LLM judges and reviewer agents more inspectable and stable.
Flags unsupported LLM judge verdicts by tracing claims back to evidence.
AI researchers, ML engineers
LangSmith · Arize Phoenix · Ragas
TL;DR: it breaks an LLM judge run into claims->evidence->verdicts and flags when a verdict is not supported by the evidence, so i can check it manually.
Flags LLM judge verdicts unsupported by evidence without needing a second model.
Cryptographic audit chain for agents, but lacks observability dashboards competing tools provide.
SHA-256 hash-chained AI audit log, but only 9 commits and ko-fi upsell.
Qualitative eval workflow for PMs when LangSmith and Arize target ML engineers.
The repo nails the governance bits: MECE decomposition, a strict source‑gate, and JSON patch specs so changes are only made when verifiable fulltext exists. It emits true DOCX tracked edits and a Q→source audit mapping — exactly the kind of deterministic audit trail regulated teams want — but the project is still early (few stars, light demos) and it’s unclear how it integrates with verification or LLM orchestration out of the box.
Cryptographically signed test evidence for FDA and EU AI Act compliance is genuinely novel.