ErrataBench - A Proofreading Benchmark for LLMs
51 models, 1613 runs, $558 spent — finally proofreading benchmarks with real numbers.
A proofreading benchmark for LLMs
Agent loop proofreading evals where HELM and LMSys are too generic.
AI engineers building editing tools or evaluating model performance
HELM · LMSys Arena · EleutherAI LM Evaluation Harness
51 models, 1613 runs, $558 spent — finally proofreading benchmarks with real numbers.
First linter + benchmark for MCP servers; catches vague schemas before LLMs pick wrong tools.
Benchmarks OpenCode models locally, but lacks preloaded datasets and only works with configured OpenAI-compatible APIs.
Cuts token costs 70% with receipts proving no accuracy drop on hard evals.
Ancient Rome Q&A benchmark shows 81pp accuracy lift, but lacks adversarial defense evidence.
Finally separates JSON validity from actual value hallucination in LLM outputs.