ErrataBench - A Proofreading Benchmark for LLMs

Name: ErrataBench - A Proofreading Benchmark for LLMs
Availability: InStock
Author: artursapek

by artursapek·Apr 7, 2026·3 points·0 comments

AI Analysis

●●SolidNiche GemBig Brain

51 models, 1613 runs, $558 spent — finally proofreading benchmarks with real numbers.

Strengths

•Distinguishes omissions from bad fixes, giving actionable failure mode breakdowns.
•Five days of runtime with transparent methodology and cost tracking throughout.
•Rankings include efficiency metrics, not just raw accuracy percentages.

Weaknesses

AI/ML●●Solid

Agent loop proofreading evals where HELM and LMSys are too generic.

Solve My ProblemShip It

artursapek

321mo ago

First linter + benchmark for MCP servers; catches vague schemas before LLMs pick wrong tools.

Solve My ProblemNiche GemBig Brain

yamarldfst

103mo ago

AI/ML●●●Banger

Opposite-narrator test catches models agreeing with both sides of same dispute.

Big BrainDark Horse

zone411

303mo ago

AI/ML●●●Banger

Cuts token costs 70% with receipts proving no accuracy drop on hard evals.

Zero to OneSolve My Problem

Jbunga

56331mo ago

AI/ML●●Solid

One-click LLM benchmarking with real tok/s metrics when llama.cpp requires manual setup.

Ship ItSolve My Problem

JoniMartin

209d ago

AI/ML●●Solid

One-command benchmark suite comparing Ollama and XGBoost performance with a shared Streamlit dashboard.

Solve My ProblemNiche Gem

albedan

2029d ago