FretBench – I tested 14 LLMs on reading guitar tabs. Most failed
Clever benchmark exposing LLM tokenization weakness on ASCII art, but narrow domain.

Proves mesh-to-BREP failure modes with IRT-calibrated scores across 28-task pilot suite.
CAD software developers, AI researchers, mechanical engineers
HumanEval · BigCodeBench · SWE-bench
Clever benchmark exposing LLM tokenization weakness on ASCII art, but narrow domain.
Editable BREP output beats mesh generators—download the code and keep building.
Ambitious curriculum bridging basic arithmetic to quantum mechanics without skipping steps.
Another browser CAD, but v0.0.1 lacks features to compete with Onshape.
Opposite-narrator test catches models agreeing with both sides of same dispute.
Wealth-based scoring reveals strategic failures that survival-only benchmarks miss.