Back to browse
Coding agents find the right GPU bottleneck 70% of the time, fix it 30%

Coding agents find the right GPU bottleneck 70% of the time, fix it 30%

by ayushnangia16·Feb 26, 2026·3 points·1 comment

AI Analysis

●●●BangerBig BrainWizardry

Reveals agents diagnose bottlenecks 87% correctly but fix them only 17%—scaffolding matters more than model.

Strengths
  • Real-world task curation from merged PRs in production systems (vLLM, SGLang), not synthetic benchmarks
  • Dual-metric framework (hard execution + soft semantic) catches lucky wins that single-metric eval misses
  • Unexpected finding: agent scaffolding outweighs model choice—Claude Code best on vLLM but worst on SGLang
Weaknesses
  • Limited to two codebases; generalization to other inference engines or domains unclear
  • Open-source models score 0%—raises questions about practical utility for teams without Claude/GPT
Category
Target Audience

ML engineers, inference platform developers, researchers studying agentic coding capability

Similar To

HumanEval · MBPP · SWE-bench

Post Description

One of the authors. Some things that surprised us while running these experiments:

The tasks are pulled from real merged PRs in vLLM and SGLang, so there's a known-good human solution for each one. Agents get the full codebase, the issue description, and a test harness. Pretty generous setup.

What we didn't expect: the agents are genuinely good at diagnosing the problem. They read the code, find the bottleneck, describe the right fix. But then the generated code has subtle bugs. Off-by-one in kernel indexing, wrong tensor shapes, missing synchronization barriers. The kind of stuff that passes a code review at first glance but segfaults under load.

The other weird result: agent rankings completely invert between codebases. Claude Code is the best performer on vLLM (46%) but the worst on SGLang (27%). TRAE with GPT-5 is the opposite pattern. Same underlying models, different agent scaffolding. It suggests the scaffolding around the model matters at least as much as the model itself.

We also tried three open-source models. None produced a single working optimization. One of them (MiniMax-M2.1) got stuck in a loop printing "I need to actually use the tools now" 2,412 times without ever making a tool call.

The benchmark, all agent transcripts, and evaluation code are open: https://ayushnangia.github.io/iso-bench-website/

Curious what others think about the scaffolding result in particular feels underexplored.

Similar Projects