Back to browse
GitHub Repository

Yet Another Llama 3.2 implementation (in pure numpy)

0 starsPython

MinLlama – Llama 3.2 inference in ~100 lines of NumPy

by timothygao·Jun 23, 2026·1 point·0 comments

AI Analysis

●●SolidCozyNiche Gem

Pure NumPy Llama 3.2 inference in 100 lines for hacking KV cache compression.

Strengths
  • Pure NumPy implementation makes every tensor operation visible and hackable for research.
  • Includes PyTorch and Jax variants with statically-shaped KV cache streaming.
  • Clear setup scripts using uv for dependency management across all backends.
Weaknesses
  • Many minimal transformer implementations already exist via llama2.c and tinygrad.
  • Requires downloading the gated HuggingFace weights before running any code locally.
Category
Target Audience

ML researchers, students, hobbyists

Similar To

llama2.c · minGPT · tinygrad

Post Description

I built minLlama because I wanted a Llama implementation that was easy to understand and hack for KV cache compression research. There is also a PyTorch and Jax version in ~140 lines.

Would be interested in feedback from people who have written transformer implementations before, are there any implementation "tricks" that I'm missing (e.g, cleaner KV cache for PyTorch/Jax or rope tricks)?

Similar Projects