Digest AI vs HN About

We achieved 72.2% issue resolution on SWE-bench Verified using AI teams

We achieved 72.2% issue resolution on SWE-bench Verified using AI teams

by NBenkovich·Feb 12, 2026·2 points·0 comments

Visit Project View on HN

AI Analysis

●●SolidWizardryBig Brain

The Take

They split responsibilities across isolated agents (engineer, reviewer, manager) that get real shell access and independent filesystems, which makes failures traceable and lets you tune model capacity per role. Hitting 72.2% on SWE-bench Verified with no benchmark-specific tuning is an impressive empirical result — interesting architecture and strong evidence — though the security and long-term reliability of autonomous shell-executing agents remain the big open questions.

Category

Developer Tools

Target Audience

Backend/frontend engineers, engineering managers, developer-tool builders, AI researchers

Similar Projects

AI/ML●●●Banger

97% on SWE-bench Verified with subscription-token agents

97% on SWE-bench Verified with full artifact transparency, not just a score claim.

Big BrainZero to One

kimjune01

2010d ago

Developer Tools●●Solid

Codex context bloat? 87% avg reduction on SWE-bench Verified traces

Transparent proxy cuts Codex context tokens by 87% via working memory.

Big BrainNiche Gem

george_ciobanu

1021mo ago

AI/ML○Pass

All the LM solutions on SWE-bench are bloated compared to humans

Twitter thread with a chart; not a product or tool.

lieret

103mo ago

AI/ML●●●●Gem

New Benchmark from SWE-bench team is 0% solved

Agents fail completely at rebuilding binaries from scratch without source code.

Big BrainBold BetZero to One

lieret

24329d ago

Developer Tools●Mid

Salacia – The First Runtime OS for Agentic Coding

Fault-localization scaffolding for AI agents; claims 93% top-5 recall, but Cursor/Cline already integrate similar.

Big BrainBold Bet

alfredhua

203mo ago

Security●●●Banger

ZeroID – Open-source identity for AI agents based on OIDF standards

RFC 8693 agent identity with delegation chains before standards even exist.

Zero to OneBold Bet

jalbrethsen

751mo ago