Back to browse
Pencil Puzzle Bench – LLM Benchmark for Multi-Step Verifiable Reasoning

Pencil Puzzle Bench – LLM Benchmark for Multi-Step Verifiable Reasoning

by bluecoconut·Mar 3, 2026·5 points·0 comments

AI Analysis

●●●BangerBig BrainCrowd PleaserSolve My Problem

62k puzzle benchmark reveals reasoning depth, cost variance, and stark US vs China model gaps.

Strengths
  • Rigorous dataset (62k puzzles, 20 types, 51 models tested) with verifiable intermediate steps—not just prompt-and-judge.
  • Agentic mode with feedback loops exposes reasoning scaling: GPT-5.2@xhigh reaches 56% vs 27% single-shot.
  • Transparent cost tracking ($0.0003 to $238 per success) and reasoning-depth comparison reveals practical trade-offs.
Weaknesses
  • Only 300 puzzles tested per model; full 62k dataset evaluation would strengthen claims about model capabilities.
  • Dataset limited to constraint-satisfaction puzzles; unclear if insights generalize to other reasoning domains.
Category
Target Audience

AI researchers, LLM developers, benchmark enthusiasts

Similar To

MMLU · Big-Bench · ARC

Post Description

I've been working on applying LLMs to long-context, verifiable problems over the past year, and today I'm releasing a benchmark of 62,000 pencil puzzles across 94 types (sudoku, nonori, slitherlink, etc.). The benchmark also allows for intermediate checks /rule breaks for all varieties at any step.

I tested 51 models against a subset (300 puzzles) in two modes: single-shot (output the full solution) and agentic (iterate with verifier feedback).

Some results:

- Best model (GPT 5.2@xhigh) solves 56%. (~ half the puzzles are unsolved by any model)

- Agentic solves average 29 turns. The longest attempt took ~1,200 turns over 14 hours.

- Cost per success varies wildly (cheapest: $0.00033 — Grok 4.1 Fast Reasoning, most expensive: $238.16 — Claude Sonnet 4.6 (1M context))

- Reasoning depth (eg. @medium, @high, @xhigh) dramatically improves capability (up to repeated infrastructure failure for @xhigh)

- Stark difference between US closed models (3 at >33%) and Chinese open models (top: 6%)

Made the website to show off the dataset + play every puzzle, and even every replay AI agent solves step-by-step (fun to watch how it gets to solutions).

Also here's the paper: https://arxiv.org/abs/2603.02119

I didn't test human ability to solve, but it seems these puzzles are pretty difficult. I'd be curious how HN audience fares on the puzzles.

Similar Projects

AI/ML●●●Banger

Legal RAG Bench

Legal RAG benchmark revealing embedding quality > LLM choice by 19-point margin.

Big BrainNiche GemSolve My Problem
beowa
413mo ago