Back to browse
Running Gemma-4 26B at 124 tokens/SEC on a CPU, no GPU

Running Gemma-4 26B at 124 tokens/SEC on a CPU, no GPU

by arun-prasath·Jun 30, 2026·7 points·0 comments

AI Analysis

●●●BangerWizardryBig BrainZero to One

26B model at 124 tok/s on CPU by compressing the output head, not the experts.

Strengths
  • Bandwidth roofline analysis reveals output head is 32% of bytes, experts only 16%
  • Speculative decoding plus top-3 experts achieves 40→124 tok/s with verified outputs
  • Reproducible recipe with dead-ends documented for others to build on
Weaknesses
  • Requires 64GB DDR5 RAM, limiting accessibility to high-end desktops
  • Running fewer experts is approximation requiring output verification
Category
Target Audience

ML engineers, robotics developers, edge AI practitioners

Similar To

llama.cpp · MLX · Off Grid AI

Post Description

I wanted to know how fast a 26B mixture-of-experts model could run on a desktop CPU with no GPU. Got ~40 tok/s single-stream (lossless) and ~124 batched. The surprising part was the byte budget: for this model you compress the output head (32% of per-token bytes), not the experts (16%). The writeup has the bandwidth roofline and the dead-ends; the repo has the reproducible recipe. Happy to answer questions.

Repo: https://github.com/arun-prasath2005/gemma4-cpu-moe

Similar Projects