Digest AI vs HN About

Benchmarking how AI models write vulnerable code under pressure

Benchmarking how AI models write vulnerable code under pressure

by kitdobyns·Apr 22, 2026·3 points·2 comments

Visit Project View on HN

AI Analysis

●●●BangerBig BrainSolve My ProblemDark Horse

Tests AI coding assistants against social engineering, not just static code quality.

Strengths

•Persona-based prompts simulate real-world pressure like deadlines and junior devs.
•Semgrep integration adds deterministic security scanning to LLM judge scores.
•Breaks down vulnerabilities by CWE type like SQLi and hardcoded credentials.

Weaknesses

•Small model set (5) limits usefulness as new versions release weekly.
•LLM judges introduce potential bias in scoring advisory quality and resistance.

Category

Target Audience

Security engineers, AI platform leads, CTOs evaluating coding assistants

Similar To

LMSYS Chatbot Arena · Hugging Face Open LLM Leaderboard · SecureBench

Similar Projects

Developer Tools●●●Banger

Cheddar-bench – unsupervised benchmark for coding agents

Unsupervised bug benchmark using agents as both attackers and defenders—novel scoring methodology.

Big BrainWizardryShip It

przadka

903mo ago

AI/ML●●●Banger

AWB – Benchmark that tests your AI coding workflow, not just the model

Tests workflow + tool + model together, not just model capability like SWE-bench.

Big BrainZero to One

xmpuspus

102mo ago

Security●●Solid

AgentToolBench-Code – security benchmark for AI coding agents

Expands corpus to 16 CVE-anchored scenarios to break model ties.

Big BrainNiche Gem

allenwu06

108d ago

AI/ML●●Solid

Speechos – Benchmark 25 speech AI models locally, no cloud needed

Side-by-side model comparison eliminates guessing which speech engine fits your hardware.

Dark HorseSolve My Problem

hamuf

113mo ago

AI/ML●●Solid

LLM Debate Benchmark

Side-swapped debate matchups expose model weaknesses standard benchmarks miss.

Big BrainDark Horse

zone411

932mo ago

Developer Tools●Mid

Skylos – A Python dead code finder benchmarked against 9 libraries

Benchmarked dead code finder across FastAPI, Pydantic, Flask—but Vulture, Bandit already solve this.

Solve My Problem

duriantaco

312mo ago