I benchmarked how good LLMs are at proofreading English
Agent loop proofreading evals where HELM and LMSys are too generic.

51 models, 1613 runs, $558 spent — finally proofreading benchmarks with real numbers.
AI researchers, developers selecting LLMs for text tasks
LMSys Chatbot Arena · HELM · LiveBench
Agent loop proofreading evals where HELM and LMSys are too generic.
First linter + benchmark for MCP servers; catches vague schemas before LLMs pick wrong tools.
Opposite-narrator test catches models agreeing with both sides of same dispute.
Cuts token costs 70% with receipts proving no accuracy drop on hard evals.
One-click LLM benchmarking with real tok/s metrics when llama.cpp requires manual setup.
One-command benchmark suite comparing Ollama and XGBoost performance with a shared Streamlit dashboard.