Back to browse
FretBench – I tested 14 LLMs on reading guitar tabs. Most failed

FretBench – I tested 14 LLMs on reading guitar tabs. Most failed

by jmcapra·Mar 9, 2026·1 point·0 comments

AI Analysis

●●SolidBig BrainNiche Gem

Clever benchmark exposing LLM tokenization weakness on ASCII art, but narrow domain.

Strengths
  • Methodically constructed benchmark (182 test cases, 4 tunings) with open-weight Qwen models achieving 83.5% vs 50% for flagship models.
  • Genuine insight: ASCII tokenization hypothesis explains performance variance across models—not obvious.
  • Fully reproducible: open-source code, clear prompts, quantified results via OpenRouter.
Weaknesses
  • Extremely narrow domain—guitar tabs are specialized use case with minimal practical impact.
  • No solution offered; purely diagnostic. Doesn't help anyone actually read tabs or build with it.
Category
Target Audience

AI researchers, LLM practitioners, musicians interested in AI capabilities and limitations.

Similar To

MMLU · ARC · Harness

Post Description

I built FretBench after noticing Gemini was confidently wrong about basic guitar tab questions. Tab is arguably the simplest notation in music: six lines, numbers for frets, read left to right. So I made a benchmark out of it.

182 test cases, 4 tunings, 14 models via OpenRouter. Two open-weight Qwen models from Alibaba crushed everything else (83.5%), while most "flagship" models scored below 50%. MiniMax M2.5 scored worse than random guessing.

Everything is open source: https://github.com/jmcapra/FretBench

I'm curious whether the performance gap is related to tokenisation of ASCII art — if anyone has insights on how different tokenisers handle grid-structured text, I'd love to hear it.

Similar Projects

AI/ML●●Solid

ModelSweep - Open-Source Benchmarking for Local LLMs

Postman for local LLMs with LLM-as-Judge and Elo ratings built in.

Ship ItNiche GemSlick
leonickson
202mo ago