I turned ARC-AGI-3 into a daily browser game
Wordle-style daily format makes ARC-AGI puzzles actually fun to play.
This repository allows reproduction of the blog post "Agentic coding improves ARC AGI 2 performance across models"
They show a surprisingly large effect: putting models into an interleaved-thinking regime with a stateful IPython REPL yields massive score boosts (>4x on GPT-OSS-120B, double-digit gains up to frontier models). The repo isn't just a paper — it includes pragmatic engineering (a patched vLLM image, ipybox/daytona integration, solver configs) so you can reproduce the results, but expect nontrivial infra setup and API/key requirements.
ML researchers/engineers, inference systems developers, ARC Prize competitors, folks experimenting with agentic prompting and code-execution loops
To bootstrap this process, we needed successful solution traces from an open-weight reasoning model for cold-start supervised fine-tuning. That requirement led us to investigate GPT-OSS-120B. While doing so, we noticed something unexpected: simply placing the model into the interleaved thinking regime produced large and consistent score improvements on ARC AGI 2 tasks. We were seeing scores that we didn’t think was possible for a medium sized OSS model.
This observation ultimately shifted the focus of our work as we wanted to find out how universally this observation applies while staying within our resource constraints. We concluded that it applies quite generally, with double digit gains in frontier models too.
Previously, I have read debates about whether ARC AGI 2 is primarily a reasoning benchmark or a visual benchmark. I guess we can now add agentic benchmark to the mix as well!
Wordle-style daily format makes ARC-AGI puzzles actually fun to play.
Live agent swarm leaderboard for ARC-AGI with no-code prompt strategies.
Real Gerbil Scheme REPL in browser with persistent state between expressions.
Philosophical thought experiment running on GitHub Actions—not a functional product or system.
Transparent Tor routing via nftables, but proxychains already solves this.
155 decision frameworks in CSV, BM25-ranked for your problem—structured thinking at prompt time.