Cheddar-bench – unsupervised benchmark for coding agents
Unsupervised bug benchmark using agents as both attackers and defenders—novel scoring methodology.

Execution-based scoring with live APIs beats LLM-graded benchmarks, but they evaluated themselves.
Engineering teams evaluating AI testing tools
Cognition Labs · AI2 · LangSmith
Unsupervised bug benchmark using agents as both attackers and defenders—novel scoring methodology.
Tests live in README as plain English; clever partial parsing eliminates Gherkin boilerplate overhead.
AI finds 250 bugs in LiteLLM, LobeChat, but no demo or accessible entry point.
AI PR generation for typos and copy, but bug reporting tools already exist elsewhere.
Open-sourced agent that actually attempts to reproduce GitHub bugs in your CI.
Measures epistemic humility and paradox tolerance using AI-evaluated scenario responses.