Cheddar-bench – unsupervised benchmark for coding agents
Unsupervised bug benchmark using agents as both attackers and defenders—novel scoring methodology.

Tests agents on 700 policy docs and noisy voice calls where AgentBench stops.
AI researchers, LLM developers, Enterprise AI teams
AgentBench · GAIA · LiveBench
τ-Knowledge: agents must navigate ~700 interconnected policy documents to complete multi-step tasks. Best frontier model (GPT-5.2, high reasoning) hits ~25%. The surprising part: even when you hand the model the exact documents it needs, performance only reaches ~40%. We found that the bottleneck isn't retrieval — it's reasoning over complex, interlinked policies and executing the right actions in the right order.
τ-Voice: same grounded tasks, but over live full-duplex voice with realistic audio — accents, background noise, interruptions, compressed phone lines. Voice agents score 31–51% in clean audio conditions and 26–38% in realistic ones. A consistent failure pattern across providers (OpenAI, Gemini, xAI): agent mishears a name or email during authentication, and everything downstream fails.
We also incorporated 75+ task fixes to the original airline, retail, and telecom domains — many based on community audits and PRs (including contributions from Amazon and Anthropic). We believe a benchmark is only as good as its maintenance, and we're grateful for the community's help improving it.
Code and leaderboard are open — we'd welcome community submissions and feedback.
Blog post (papers, code, leaderboard): https://sierra.ai/blog/bench-advancing-agent-benchmarking-to...
Unsupervised bug benchmark using agents as both attackers and defenders—novel scoring methodology.
Agents fail completely at rebuilding binaries from scratch without source code.
First benchmark for physical-world AI when MMLU only tests textbook knowledge.
Agent loop proofreading evals where HELM and LMSys are too generic.
Using 1980s Rogue as an LLM benchmark is genuinely novel and technically clever.
Rust swarm vs LLM agents is clever positioning, but benchmarks are self-designed and lack third-party validation.