Pencil Puzzle Bench – LLM Benchmark for Multi-Step Verifiable Reasoning

Name: Pencil Puzzle Bench – LLM Benchmark for Multi-Step Verifiable Reasoning
Availability: InStock
Author: bluecoconut

by bluecoconut·Mar 3, 2026·5 points·0 comments

Visit Project View on HN

AI Analysis

●●●BangerBig BrainCrowd PleaserSolve My Problem

62k puzzle benchmark reveals reasoning depth, cost variance, and stark US vs China model gaps.

Strengths

•Rigorous dataset (62k puzzles, 20 types, 51 models tested) with verifiable intermediate steps—not just prompt-and-judge.
•Agentic mode with feedback loops exposes reasoning scaling: GPT-5.2@xhigh reaches 56% vs 27% single-shot.
•Transparent cost tracking ($0.0003 to $238 per success) and reasoning-depth comparison reveals practical trade-offs.

Weaknesses

•Only 300 puzzles tested per model; full 62k dataset evaluation would strengthen claims about model capabilities.
•Dataset limited to constraint-satisfaction puzzles; unclear if insights generalize to other reasoning domains.

Post Description

I've been working on applying LLMs to long-context, verifiable problems over the past year, and today I'm releasing a benchmark of 62,000 pencil puzzles across 94 types (sudoku, nonori, slitherlink, etc.). The benchmark also allows for intermediate checks /rule breaks for all varieties at any step.

I tested 51 models against a subset (300 puzzles) in two modes: single-shot (output the full solution) and agentic (iterate with verifier feedback).

Some results:

- Best model (GPT 5.2@xhigh) solves 56%. (~ half the puzzles are unsolved by any model)

- Agentic solves average 29 turns. The longest attempt took ~1,200 turns over 14 hours.

- Cost per success varies wildly (cheapest: $0.00033 — Grok 4.1 Fast Reasoning, most expensive: $238.16 — Claude Sonnet 4.6 (1M context))

- Reasoning depth (eg. @medium, @high, @xhigh) dramatically improves capability (up to repeated infrastructure failure for @xhigh)

- Stark difference between US closed models (3 at >33%) and Chinese open models (top: 6%)

Made the website to show off the dataset + play every puzzle, and even every replay AI agent solves step-by-step (fun to watch how it gets to solutions).

Also here's the paper: https://arxiv.org/abs/2603.02119

I didn't test human ability to solve, but it seems these puzzles are pretty difficult. I'd be curious how HN audience fares on the puzzles.