GitHub Repository

Benchmark harness measuring AI coding tool+workflow performance, not just model capability. 100 tasks, sigmoid scoring, 12 capability dimensions, gap analysis.

10 starsPython

AWB – Benchmark that tests your AI coding workflow, not just the model

Name: AWB – Benchmark that tests your AI coding workflow, not just the model
Availability: InStock
Author: xmpuspus

by xmpuspus·Mar 22, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●●●BangerBig BrainZero to One

Tests workflow + tool + model together, not just model capability like SWE-bench.

Strengths

•80 tasks from real open-source repos with pinned commits
•7 scoring dimensions including security, cost, and reliability
•Sigmoid normalization prevents score collapse at boundaries

Weaknesses

•Zero stars means no community validation yet
•Workflow benchmarking category will attract competitors quickly

Similar Projects

Developer Tools●●●Banger

Tracecore: Benchmark AI Agents on Deterministic Coding Tasks

Deterministic agent benchmarking with strict validation—unlike SWE-Bench, measures whether agents actually operate.

Solve My ProblemWizardryNiche Gem

extra_cookin

103mo ago

AI/ML●●●Banger

OpenCastor Agent Harness Evaluator Leaderboard

263k config search space benchmarked across robot fleets—nothing like this exists for robotics AI.

Zero to OneBig BrainNiche Gem

craigm26

312mo ago

AI/ML●●●Banger

LLM Sycophancy Benchmark: Opposite-Narrator Contradictions

Opposite-narrator test catches models agreeing with both sides of same dispute.

Big BrainDark Horse

zone411

303mo ago

Security●●Solid

AgentToolBench-Code – security benchmark for AI coding agents

Expands corpus to 16 CVE-anchored scenarios to break model ties.

Big BrainNiche Gem

allenwu06

1022d ago

AI/ML●●Solid

Agentic Intent Benchmark

First benchmark testing structured requirements on complex greenfield agent tasks.

Niche GemBig Brain

ryan4rtmx

2019d ago

AI/ML●●Solid

jj-benchmark – Evaluating AI agents on Jujutsu version control

AI benchmarking for jj CLI when LMSys and HuggingFace already dominate the space.

Niche GemBig Brain

wsxiaoys

523mo ago