Cheddar-bench – unsupervised benchmark for coding agents
Unsupervised bug benchmark using agents as both attackers and defenders—novel scoring methodology.

Reveals agents diagnose bottlenecks 87% correctly but fix them only 17%—scaffolding matters more than model.
ML engineers, inference platform developers, researchers studying agentic coding capability
HumanEval · MBPP · SWE-bench
The tasks are pulled from real merged PRs in vLLM and SGLang, so there's a known-good human solution for each one. Agents get the full codebase, the issue description, and a test harness. Pretty generous setup.
What we didn't expect: the agents are genuinely good at diagnosing the problem. They read the code, find the bottleneck, describe the right fix. But then the generated code has subtle bugs. Off-by-one in kernel indexing, wrong tensor shapes, missing synchronization barriers. The kind of stuff that passes a code review at first glance but segfaults under load.
The other weird result: agent rankings completely invert between codebases. Claude Code is the best performer on vLLM (46%) but the worst on SGLang (27%). TRAE with GPT-5 is the opposite pattern. Same underlying models, different agent scaffolding. It suggests the scaffolding around the model matters at least as much as the model itself.
We also tried three open-source models. None produced a single working optimization. One of them (MiniMax-M2.1) got stuck in a loop printing "I need to actually use the tools now" 2,412 times without ever making a tool call.
The benchmark, all agent transcripts, and evaluation code are open: https://ayushnangia.github.io/iso-bench-website/
Curious what others think about the scaffolding result in particular feels underexplored.
Unsupervised bug benchmark using agents as both attackers and defenders—novel scoring methodology.
Closes the loop: agents read eval traces to fix their own regressions.
Frontier models hit 67-75% outcome accuracy but only 25-42% on process compliance.
Persistent context layer beats Cursor's session amnesia on large codebases.
Deterministic agent benchmarking with strict validation—unlike SWE-Bench, measures whether agents actually operate.
REPL with agentic LLM can inspect runtime and fix code—clever, but prototype-stage and insecure.