Back to browse
SurvivalIndex – which developer tools do AI agents choose?

SurvivalIndex – which developer tools do AI agents choose?

by scalefirst·Mar 7, 2026·1 point·3 comments

AI Analysis

●●SolidBig BrainNiche Gem

Measures what agents *actually pick*, not just capability—reveals tool blindness Claude misses.

Strengths
  • Novel failure mode detection: agents can use tools but don't reach for them—orthogonal to BFCL benchmarks.
  • Structured methodology with human coefficient variable, AAS scoring transparency; methodology page provided.
  • Genuine empirical data: running standardized repos with natural-language prompts, no priming.
Weaknesses
  • Only 33 tools tracked, only 6 marked as 'hidden gems'—sample too small to claim survivorship patterns.
  • No evidence agents were tested across recent tool versions; methodology doesn't specify Claude version dates.
  • Leaderboard lacks confidence intervals or statistical significance; human rater agreement/disagreement not shown.
Target Audience

Tool builders, AI researchers, developers curious about agent behavior

Similar To

BFCL (Berkeley Function Calling Leaderboard) · Chatbot Arena

Post Description

We've been running coding agents against standardized repos with natural-language prompts — no tool names, no hints — and measuring what they actually choose.

Early finding: Claude Code picks Custom/DIY in 12 of 20 categories. Not because it can't use the tools (BFCL scores suggest it can) but because it doesn't reach for them. That's a different failure mode than capability benchmarks measure.

We score each tool on: agent visibility, pick rate vs Custom/DIY, cross-context breadth, expert human ratings, and implementation success rate. Tools above survival=1 persist. Below it, agents synthesize around them.

Methodology is at survivalindex.org/methodology. Very curious what people think of the measurement approach, especially the human coefficient variable.

Similar Projects

Developer Tools●●Solid

Ambits – Claude Code agent coverage tooling

Tails Claude Code's JSONL and paints every function/struct/class by read-depth (unseen → name-only → full body) in a live terminal tree — plus automatic staleness marking when files change. The multi-agent tracking and optional Serena LSP backend are smart touches that make this more than a neat demo: it's practical observability for agent-driven workflows, though it's tightly coupled to Claude/Serena ecosystems.

Niche GemWizardry
joshLong145
103mo ago

Codingagents.md – The open directory for AI coding agents

This is the kind of curated index I wish existed yesterday: agent pages, config format examples, SDK links and two named protocols (MCP/ACP) all collected in one place, plus a weekly-ranked table of models with context-length notes. It feels like real curation rather than linkspam, but the site leans on lists and scores — show the benchmark methodology, reproducible tests or interactive demos and the rankings would become trustable rather than just convenient.

Solve My ProblemNiche Gem
meame2010
544mo ago
AI/ML●●Solid

Agentic Intent Benchmark

First benchmark testing structured requirements on complex greenfield agent tasks.

Niche GemBig Brain
ryan4rtmx
2017d ago