AgentToolBench-Code – security benchmark for AI coding agents
Expands corpus to 16 CVE-anchored scenarios to break model ties.
First systematic attack framework proving 7/9 exploits work on AI agents with shell access.
AI/ML engineers, security researchers, developers building autonomous agents with code execution
OWASP LLM Top 10 · Prompt Injection benchmarks (HuggingFace, Anthropic's red-teaming) · Container escape test suites
Expands corpus to 16 CVE-anchored scenarios to break model ties.
Agent red-teaming via UI, but attack catalog is shallow and comparison unclear vs. manual testing.
The author walks the reader through a full exploit chain that starts with a UX/trust-boundary trick and ends in RCE by causing a client to connect to an attacker gateway, leak a token, and reconfigure the agent’s execution environment. It's a sharp systems narrative that will change how you think about agents crossing chat, browser, and local tooling — excellent reading for defenders and attacker-minded engineers, but it's an investigative article rather than a ship-or-tool.
Benchmarked dead code finder across FastAPI, Pydantic, Flask—but Vulture, Bandit already solve this.
Scanner benchmarking for DAST tools. DVWA and Juice Shop dominate security training.
White-box agent red teaming finds 5x more vulns than black-box prompt injection.