GitHub Repository

A memory layer that tracks evidence, claims, and decisions to make multi-turn LLM judges and reviewer agents more inspectable and stable.

3 starsPython

I built a small audit layer for LLM-as-judge decisions

Name: I built a small audit layer for LLM-as-judge decisions
Availability: InStock
Author: ML0037

by ML0037·Jun 26, 2026·2 points·0 comments

Visit Project View on HN

AI Analysis

●●SolidNiche GemBig Brain

Flags unsupported LLM judge verdicts by tracing claims back to evidence.

Strengths

•Breaks judge reasoning into explicit claim-evidence graphs.
•Local viewer flags missing evidence or ignored rubrics.
•Addresses known LLM judge bias without adding another model.

Weaknesses

•Niche audience limits broader developer appeal.
•Eval infrastructure space is getting crowded with established tools.

Post Description

I made this while checking model graded answer and helped me to check the odd cases by hand. Not sure if it’s useful to anyone else.

TL;DR: it breaks an LLM judge run into claims->evidence->verdicts and flags when a verdict is not supported by the evidence, so i can check it manually.

Similar Projects

AI/ML●●Solid

I made a small helper for checking model-graded answers

Flags LLM judge verdicts unsupported by evidence without needing a second model.

Big BrainNiche Gem

ML0037

107d ago

Infrastructure●●Solid

Air – Open-source black box for AI agents (tamper-evident audit trails)

Cryptographic audit chain for agents, but lacks observability dashboards competing tools provide.

Big BrainWizardry

shotwellj

214mo ago

Security●●Solid

CEL v0.2 Pro – cryptographic black box recorder for AI systems (Python)

SHA-256 hash-chained AI audit log, but only 9 commits and ko-fi upsell.

Big BrainBold Bet

GhurtSky

103mo ago

AI/ML●●Solid

GEDD – A Systematic Evidence Driven LLM as a Judge Framework

Qualitative eval workflow for PMs when LangSmith and Arize target ML engineers.

Big BrainNiche Gem

balasvce2026

2012d ago

Open Source●Mid

OpenRevise is the Harvey for all industries

The repo nails the governance bits: MECE decomposition, a strict source‑gate, and JSON patch specs so changes are only made when verifiable fulltext exists. It emits true DOCX tracked edits and a Q→source audit mapping — exactly the kind of deterministic audit trail regulated teams want — but the project is still early (few stars, light demos) and it’s unclear how it integrates with verification or LLM orchestration out of the box.

Niche GemSolve My Problem

alfredray

304mo ago

AI/ML●●●Banger

AgentCarousel – behavioral tests for AI agents, with signed evidence

Cryptographically signed test evidence for FDA and EU AI Act compliance is genuinely novel.

Big BrainZero to One

neemsio

2015d ago