Back to browse
ErrataBench - A Proofreading Benchmark for LLMs

ErrataBench - A Proofreading Benchmark for LLMs

by artursapek·Apr 7, 2026·3 points·0 comments

AI Analysis

●●SolidNiche GemBig Brain

51 models, 1613 runs, $558 spent — finally proofreading benchmarks with real numbers.

Strengths
  • Distinguishes omissions from bad fixes, giving actionable failure mode breakdowns.
  • Five days of runtime with transparent methodology and cost tracking throughout.
  • Rankings include efficiency metrics, not just raw accuracy percentages.
Weaknesses
  • Narrow scope limits utility — proofreading is just one of many LLM use cases.
  • LLM benchmarks are common; LMSys and HELM already dominate the space.
Category
Target Audience

AI researchers, developers selecting LLMs for text tasks

Similar To

LMSys Chatbot Arena · HELM · LiveBench

Similar Projects