Digest AI vs HN About

Black-box API bug detection across 7 AI systems

Black-box API bug detection across 7 AI systems

by riyajoshi·Jun 4, 2026·11 points·4 comments

Visit Project View on HN

AI Analysis

●MidSlick

Execution-based scoring with live APIs beats LLM-graded benchmarks, but they evaluated themselves.

Strengths

•Execution-based scoring with live APIs and planted bugs is verifiable, not subjective.
•Tests three complexity tiers across seven application domains with repeated runs.

Weaknesses

•KushoAI created the benchmark and evaluated themselves — obvious conflict of interest.
•AI agent benchmark space is crowded with Cognition, AI2, and countless others.

Category

Target Audience

Engineering teams evaluating AI testing tools

Similar To

Cognition Labs · AI2 · LangSmith

Similar Projects

Developer Tools●●●Banger

Cheddar-bench – unsupervised benchmark for coding agents

Unsupervised bug benchmark using agents as both attackers and defenders—novel scoring methodology.

Big BrainWizardryShip It

przadka

904mo ago

Developer Tools●●Solid

Bbt – Black Box Testing Directly from Your Documentation

Tests live in README as plain English; clever partial parsing eliminates Gherkin boilerplate overhead.

Big BrainNiche Gem

LionelDraghi

305mo ago

Finance●●●Banger

Test harness that found 250 bugs in open-source matching engines

Found 250 bugs in 247 engines using a byte-identical consensus oracle nobody else built.

WizardryBig Brain

roycechocolate

1022h ago

Developer Tools●Mid

Assay – Found 250 bugs in LiteLLM, LobeChat via AI code verification

AI finds 250 bugs in LiteLLM, LobeChat, but no demo or accessible entry point.

Big Brain

tywellshn

315mo ago

Developer Tools○Pass

Make every bug perfectly reproducible

Landing page is a Cloudflare bot check—no demo, no code, no way to evaluate claims.

Bold Bet

chaitanyya

1321mo ago

Developer Tools●●Solid

I built a bug reporter that opens a GitHub PR to fix the bug

AI PR generation for typos and copy, but bug reporting tools already exist elsewhere.

Solve My ProblemSlickShip It

kosbay

104mo ago