Coding agents find the right GPU bottleneck 70% of the time, fix it 30%

Name: Coding agents find the right GPU bottleneck 70% of the time, fix it 30%
Availability: InStock
Author: ayushnangia16

by ayushnangia16·Feb 26, 2026·3 points·1 comment

Visit Project View on HN

AI Analysis

●●●BangerBig BrainWizardry

Reveals agents diagnose bottlenecks 87% correctly but fix them only 17%—scaffolding matters more than model.

Strengths

•Real-world task curation from merged PRs in production systems (vLLM, SGLang), not synthetic benchmarks
•Dual-metric framework (hard execution + soft semantic) catches lucky wins that single-metric eval misses
•Unexpected finding: agent scaffolding outweighs model choice—Claude Code best on vLLM but worst on SGLang

Weaknesses

•Limited to two codebases; generalization to other inference engines or domains unclear
•Open-source models score 0%—raises questions about practical utility for teams without Claude/GPT

Post Description

One of the authors. Some things that surprised us while running these experiments:

The tasks are pulled from real merged PRs in vLLM and SGLang, so there's a known-good human solution for each one. Agents get the full codebase, the issue description, and a test harness. Pretty generous setup.

What we didn't expect: the agents are genuinely good at diagnosing the problem. They read the code, find the bottleneck, describe the right fix. But then the generated code has subtle bugs. Off-by-one in kernel indexing, wrong tensor shapes, missing synchronization barriers. The kind of stuff that passes a code review at first glance but segfaults under load.

The other weird result: agent rankings completely invert between codebases. Claude Code is the best performer on vLLM (46%) but the worst on SGLang (27%). TRAE with GPT-5 is the opposite pattern. Same underlying models, different agent scaffolding. It suggests the scaffolding around the model matters at least as much as the model itself.

We also tried three open-source models. None produced a single working optimization. One of them (MiniMax-M2.1) got stuck in a loop printing "I need to actually use the tools now" 2,412 times without ever making a tool call.

The benchmark, all agent transcripts, and evaluation code are open: https://ayushnangia.github.io/iso-bench-website/

Curious what others think about the scaffolding result in particular feels underexplored.