97% on SWE-bench Verified with subscription-token agents
97% on SWE-bench Verified with full artifact transparency, not just a score claim.

They split responsibilities across isolated agents (engineer, reviewer, manager) that get real shell access and independent filesystems, which makes failures traceable and lets you tune model capacity per role. Hitting 72.2% on SWE-bench Verified with no benchmark-specific tuning is an impressive empirical result — interesting architecture and strong evidence — though the security and long-term reliability of autonomous shell-executing agents remain the big open questions.
Backend/frontend engineers, engineering managers, developer-tool builders, AI researchers
97% on SWE-bench Verified with full artifact transparency, not just a score claim.
Transparent proxy cuts Codex context tokens by 87% via working memory.
Twitter thread with a chart; not a product or tool.
Agents fail completely at rebuilding binaries from scratch without source code.
Fault-localization scaffolding for AI agents; claims 93% top-5 recall, but Cursor/Cline already integrate similar.
RFC 8693 agent identity with delegation chains before standards even exist.