Back to browse
GitHub Repository
8 starsPython

Running an LLM Inside Scratch

by broyojo·Feb 12, 2026·1 point·0 comments

AI Analysis

●●●BangerWizardryRabbit HoleBold Bet

LLM inference inside Scratch at 1 token per 10 seconds — absurd, intentional, and it works.

Strengths
  • Genuine compile-to-Scratch pipeline transforms C inference into valid blocks, not an API wrapper.
  • Clever memory packing: quantizes weights to Q8_0, maps entire model into single Scratch list with fixed addresses.
  • Working live demo on MIT Scratch proves feasibility; streaming token generation in a sprite's speech bubble.
Weaknesses
  • 1 token every 10 seconds makes it a tech demonstration rather than usable tool.
  • Extremely narrow audience: requires understanding Scratch VM, llvm2scratch, and llama2.c simultaneously.
Target Audience

Scratch/TurboWarp users, demoscene programmers, constraint-coding enthusiasts

Similar To

llama2.c · TurboWarp (VM fork) · Constraint-coding demoscene projects

Post Description

This runs the smallest llama2.c checkpoint (stories260K) inside Scratch/TurboWarp by compiling C inference code into Scratch blocks using llvm2scratch. The model is quantized to Q8_0 and packed into Scratch lists. If everything works, the sprite streams "Once upon a time..." token-by-token into its speech bubble.

I started this as an experiment in how far Scratch's VM could be pushed, and because the idea of running an LLM inside Scratch felt absurd and fun. The main challenges were fitting quantized weights into list memory, working around JS call stack limits, and patching llvm2scratch to support additional IR patterns emitted by clang -O2.

Generates ~1 token every 10 seconds.

Live demo: https://scratch.mit.edu/projects/1277883263

Source: https://github.com/broyojo/llm_from_scratch

Similar Projects

AI/ML●●●Banger

Glq LLM quantization using E8 lattice

E8 lattice codebooks beat GPTQ at 2-4 bpw with fused CUDA kernel skipping weight materialization.

WizardryBig Brain
acd
2013d ago