Back to browse
GitHub Repository

Zero-dependency C99 GPT-2 engine for edge AI. Sub-1M parameter models train on-device in seconds. Organelle Pipeline Architecture (OPA) coordinates specialised micro-models — 91% win rates on 11 logic games with 30K–160K parameters. Composition beats capacity.

106 starsC

Andrej Karpathy's microgpt.py to C99 microgpt.c – 4,600x faster

by Ajay__soni·Feb 17, 2026·40 points·3 comments

AI Analysis

●●●BangerWizardryZero to One

Pure C99 GPT with SIMD beats Python 4,600x; drop two files into any project.

Strengths
  • Full training pipeline (forward, backward, Adam optimizer) in ~1,500 lines; auditable baseline for research
  • INT8 quantization + SIMD auto-vectorization + cross-platform multithreading prove production rigor, not toy code
  • 39 unit tests + 15 benchmarks + character/word tokenizers show serious engineering discipline
Weaknesses
  • Audience is tiny (systems engineers + ML researchers, not practitioners); educational value doesn't make this a market tool
  • Karpathy's microgpt.py already solved the 'understand GPT' problem; this adds optimization, not insight
Category
Target Audience

Systems engineers, embedded developers, ML researchers exploring low-level optimization

Similar To

Karpathy's microgpt.py · llama.c · tinygrad

Post Description

Andrej Karpathy showed us the GPT algorithm. I wanted to see the hardware limit.

The Punchline: I made it go 4,600x faster in pure C code, no dependencies and using a compiler with SIMD auto-vectorisation!!!

Andrej recently released microgpt.py - a brilliant, atomic look at the core of a GPT. As a low-latency developer, I couldn't resist seeing how fast it could go when you get closer to the metal.

So just for funzies, I spent a few hours building microgpt-c, a zero-dependency and pure C99 implementation featuring:

- 4,600x Faster training vs the Python reference (Tested on MacBook Pro M2 Max). On Windows, it is 2,300x faster. - SIMD Auto-vectorisation for high-speed matrix operations. - INT8 Quantisation (reducing weight storage by ~8x). Training is slightly slower, but the storage reduction is significant.

- Zero Dependencies - just pure logic.

The amalgamation image below is just for fun (and to show off the density!), but the GitHub repo contains the fully commented, structured code for anyone who wants to play with on-device AI.

I have started to build something useful, like a simple C code static analyser - I will do a follow-up post.

Everything else is just efficiency... but efficiency is where the magic happens

Similar Projects

AI/ML●●●Banger

MicroGPT-C – C99 GPT for Edge Training and Tiny Model Pipelines

Karpathy's microgpt in C99, proves tiny coordinated models beat single large models on logic.

WizardryBig Brain
Ajay__soni
103mo ago
Education●●Solid

Interactive visualizer for Karpathy's 243-line microGPT

Type a name and you can literally watch characters turn into IDs, 16‑dim embeddings get added with positional encodings, and causal attention matrices animate per head — all matched numerically to Karpathy's 244‑line microGPT. The implementation is pure TypeScript (no ML libs) and includes a helpful scrollable sidebar with the reference math, which makes this an excellent, low‑friction learning tool — more pedagogical deep dive than research innovation.

Rabbit HoleNiche GemEye Candy
Sayyed23
114mo ago