Digest AI vs HN About

GitHub Repository

A proofreading benchmark for LLMs

5 starsTypeScript

I benchmarked how good LLMs are at proofreading English

by artursapek·Apr 25, 2026·3 points·2 comments

Visit Project View on HN

AI Analysis

●●SolidSolve My ProblemShip It

Agent loop proofreading evals where HELM and LMSys are too generic.

Strengths

•Tool-calling agent loop mimics real editing workflows better than static prompts.
•Public viewer with 1600+ samples makes verifying claims transparent and easy.
•Supports OpenAI-compatible endpoints for benchmarking self-hosted or internal models.

Weaknesses

•Running the full benchmark costs $550, limiting accessibility for individual developers.
•Proofreading is a narrow slice compared to broader reasoning or coding benchmarks.

Category

Target Audience

AI engineers building editing tools or evaluating model performance

Similar To

HELM · LMSys Arena · EleutherAI LM Evaluation Harness

Similar Projects

AI/ML●●Solid

ErrataBench - A Proofreading Benchmark for LLMs

51 models, 1613 runs, $558 spent — finally proofreading benchmarks with real numbers.

Niche GemBig Brain

artursapek

301mo ago

Developer Tools●●●Banger

AgentDX – Open-source linter and LLM benchmark for MCP servers

First linter + benchmark for MCP servers; catches vague schemas before LLMs pick wrong tools.

Solve My ProblemNiche GemBig Brain

yamarldfst

103mo ago

Developer Tools●Mid

OpenCode Benchmark Dashboard

Benchmarks OpenCode models locally, but lacks preloaded datasets and only works with configured OpenAI-compatible APIs.

Niche Gem

grigio

102mo ago

AI/ML●●●Banger

Reducing LLM input tokens by 70%

Cuts token costs 70% with receipts proving no accuracy drop on hard evals.

Zero to OneSolve My Problem

Jbunga

563322d ago

AI/ML●Mid

100% LLM accuracy–no fine-tuning, JSON only

Ancient Rome Q&A benchmark shows 81pp accuracy lift, but lacks adversarial defense evidence.

Big Brain

MysticBirdie

223mo ago

AI/ML●●●Banger

A new benchmark for testing LLMs for deterministic outputs

Finally separates JSON validity from actual value hallucination in LLM outputs.

Big BrainSolve My Problem

khurdula

60301mo ago