MinLlama – Llama 3.2 inference in ~100 lines of NumPy

Name: MinLlama – Llama 3.2 inference in ~100 lines of NumPy
Availability: InStock
Author: timothygao

by timothygao·Jun 23, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●●SolidCozyNiche Gem

Pure NumPy Llama 3.2 inference in 100 lines for hacking KV cache compression.

Strengths

•Pure NumPy implementation makes every tensor operation visible and hackable for research.
•Includes PyTorch and Jax variants with statically-shaped KV cache streaming.
•Clear setup scripts using uv for dependency management across all backends.

Weaknesses

•Many minimal transformer implementations already exist via llama2.c and tinygrad.
•Requires downloading the gated HuggingFace weights before running any code locally.

Post Description

I built minLlama because I wanted a Llama implementation that was easy to understand and hack for KV cache compression research. There is also a PyTorch and Jax version in ~140 lines.

Would be interested in feedback from people who have written transformer implementations before, are there any implementation "tricks" that I'm missing (e.g, cleaner KV cache for PyTorch/Jax or rope tricks)?