Running Gemma-4 26B at 124 tokens/SEC on a CPU, no GPU

Name: Running Gemma-4 26B at 124 tokens/SEC on a CPU, no GPU
Availability: InStock
Author: arun-prasath

by arun-prasath·Jun 30, 2026·7 points·0 comments

Visit Project View on HN

AI Analysis

●●●BangerWizardryBig BrainZero to One

26B model at 124 tok/s on CPU by compressing the output head, not the experts.

Strengths

•Bandwidth roofline analysis reveals output head is 32% of bytes, experts only 16%
•Speculative decoding plus top-3 experts achieves 40→124 tok/s with verified outputs
•Reproducible recipe with dead-ends documented for others to build on

Weaknesses

•Requires 64GB DDR5 RAM, limiting accessibility to high-end desktops
•Running fewer experts is approximation requiring output verification

Post Description

I wanted to know how fast a 26B mixture-of-experts model could run on a desktop CPU with no GPU. Got ~40 tok/s single-stream (lossless) and ~124 batched. The surprising part was the byte budget: for this model you compress the output head (32% of per-token bytes), not the experts (16%). The writeup has the bandwidth roofline and the dead-ends; the repo has the reproducible recipe. Happy to answer questions.

Repo: https://github.com/arun-prasath2005/gemma4-cpu-moe