Back to browse
GitHub Repository

This repository allows reproduction of the blog post "Agentic coding improves ARC AGI 2 performance across models"

5 starsPython

Solving ARC AGI 2 with interleaved thinking and stateful IPython REPL

by steinsgate·Feb 17, 2026·2 points·0 comments

AI Analysis

●●SolidWizardryNiche Gem
The Take

They show a surprisingly large effect: putting models into an interleaved-thinking regime with a stateful IPython REPL yields massive score boosts (>4x on GPT-OSS-120B, double-digit gains up to frontier models). The repo isn't just a paper — it includes pragmatic engineering (a patched vLLM image, ipybox/daytona integration, solver configs) so you can reproduce the results, but expect nontrivial infra setup and API/key requirements.

Category
Target Audience

ML researchers/engineers, inference systems developers, ARC Prize competitors, folks experimenting with agentic prompting and code-execution loops

Post Description

My friends and I started this project in the summer of 2025 with the initial goal of participating in the ARC Prize Kaggle competition. Early on, we were exploring agentic coding with frontier reasoning models and found that models like o3 and o4-mini could generate high-quality synthetic ARC-style puzzles. Our plan was to use these synthetic puzzles to train a smaller model via agentic reinforcement learning (RLVR with interleaved thinking).

To bootstrap this process, we needed successful solution traces from an open-weight reasoning model for cold-start supervised fine-tuning. That requirement led us to investigate GPT-OSS-120B. While doing so, we noticed something unexpected: simply placing the model into the interleaved thinking regime produced large and consistent score improvements on ARC AGI 2 tasks. We were seeing scores that we didn’t think was possible for a medium sized OSS model.

This observation ultimately shifted the focus of our work as we wanted to find out how universally this observation applies while staying within our resource constraints. We concluded that it applies quite generally, with double digit gains in frontier models too.

Previously, I have read debates about whether ARC AGI 2 is primarily a reasoning benchmark or a visual benchmark. I guess we can now add agentic benchmark to the mix as well!

Similar Projects

SecurityMid

How I solved my network state corruption in my Linux Tor proxy

Transparent Tor routing via nftables, but proxychains already solves this.

Ship It
onyks
311mo ago