FretBench – I tested 14 LLMs on reading guitar tabs. Most failed

Name: FretBench – I tested 14 LLMs on reading guitar tabs. Most failed
Availability: InStock
Author: jmcapra

by jmcapra·Mar 9, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●●SolidBig BrainNiche Gem

Clever benchmark exposing LLM tokenization weakness on ASCII art, but narrow domain.

Strengths

•Methodically constructed benchmark (182 test cases, 4 tunings) with open-weight Qwen models achieving 83.5% vs 50% for flagship models.
•Genuine insight: ASCII tokenization hypothesis explains performance variance across models—not obvious.
•Fully reproducible: open-source code, clear prompts, quantified results via OpenRouter.

Weaknesses

•Extremely narrow domain—guitar tabs are specialized use case with minimal practical impact.
•No solution offered; purely diagnostic. Doesn't help anyone actually read tabs or build with it.

Post Description

I built FretBench after noticing Gemini was confidently wrong about basic guitar tab questions. Tab is arguably the simplest notation in music: six lines, numbers for frets, read left to right. So I made a benchmark out of it.

182 test cases, 4 tunings, 14 models via OpenRouter. Two open-weight Qwen models from Alibaba crushed everything else (83.5%), while most "flagship" models scored below 50%. MiniMax M2.5 scored worse than random guessing.

Everything is open source: https://github.com/jmcapra/FretBench

I'm curious whether the performance gap is related to tokenisation of ASCII art — if anyone has insights on how different tokenisers handle grid-structured text, I'd love to hear it.