Running an LLM Inside Scratch

Name: Running an LLM Inside Scratch
Availability: InStock
Author: broyojo

by broyojo·Feb 12, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●●●BangerWizardryRabbit HoleBold Bet

LLM inference inside Scratch at 1 token per 10 seconds — absurd, intentional, and it works.

Strengths

•Genuine compile-to-Scratch pipeline transforms C inference into valid blocks, not an API wrapper.
•Clever memory packing: quantizes weights to Q8_0, maps entire model into single Scratch list with fixed addresses.
•Working live demo on MIT Scratch proves feasibility; streaming token generation in a sprite's speech bubble.

Weaknesses

•1 token every 10 seconds makes it a tech demonstration rather than usable tool.
•Extremely narrow audience: requires understanding Scratch VM, llvm2scratch, and llama2.c simultaneously.

Post Description

This runs the smallest llama2.c checkpoint (stories260K) inside Scratch/TurboWarp by compiling C inference code into Scratch blocks using llvm2scratch. The model is quantized to Q8_0 and packed into Scratch lists. If everything works, the sprite streams "Once upon a time..." token-by-token into its speech bubble.

I started this as an experiment in how far Scratch's VM could be pushed, and because the idea of running an LLM inside Scratch felt absurd and fun. The main challenges were fitting quantized weights into list memory, working around JS call stack limits, and patching llvm2scratch to support additional IR patterns emitted by clang -O2.

Generates ~1 token every 10 seconds.

Live demo: https://scratch.mit.edu/projects/1277883263

Source: https://github.com/broyojo/llm_from_scratch