Back to browse
GitHub Repository

A proofreading benchmark for LLMs

5 starsTypeScript

I benchmarked how good LLMs are at proofreading English

by artursapek·Apr 25, 2026·3 points·2 comments

AI Analysis

●●SolidSolve My ProblemShip It

Agent loop proofreading evals where HELM and LMSys are too generic.

Strengths
  • Tool-calling agent loop mimics real editing workflows better than static prompts.
  • Public viewer with 1600+ samples makes verifying claims transparent and easy.
  • Supports OpenAI-compatible endpoints for benchmarking self-hosted or internal models.
Weaknesses
  • Running the full benchmark costs $550, limiting accessibility for individual developers.
  • Proofreading is a narrow slice compared to broader reasoning or coding benchmarks.
Category
Target Audience

AI engineers building editing tools or evaluating model performance

Similar To

HELM · LMSys Arena · EleutherAI LM Evaluation Harness

Similar Projects

OpenCode Benchmark Dashboard

Benchmarks OpenCode models locally, but lacks preloaded datasets and only works with configured OpenAI-compatible APIs.

Niche Gem
grigio
102mo ago
AI/MLMid

100% LLM accuracy–no fine-tuning, JSON only

Ancient Rome Q&A benchmark shows 81pp accuracy lift, but lacks adversarial defense evidence.

Big Brain
MysticBirdie
223mo ago