I wrote an LLM inference engine in pure Go – 48 tok/s zero dependencies

Name: I wrote an LLM inference engine in pure Go – 48 tok/s zero dependencies
Availability: InStock
Author: computerex

by computerex·Mar 7, 2026·2 points·0 comments

Visit Project View on HN

AI Analysis

●●●BangerZero to OneWizardryBig Brain

Pure Go LLM inference, zero dependencies, 48 tok/s—genuinely novel for Go ecosystem.

Strengths

•Zero external dependencies (SIMD optional) plus pure Go implementation lowers deployment friction dramatically
•Declarative architecture spec resolved at load time means adding new models is config, not code rewrite
•Covers 25+ quantization formats, Whisper, and multi-turn chat—serious breadth for single developer

Weaknesses

•~48 tok/s on small models significantly slower than llama.cpp; won't replace it for latency-critical apps
•Apple Silicon + Linux support only mentioned; Windows support unclear

Post Description

dlgo is a pure Go deep learning inference engine. It loads GGUF models and runs them on CPU with no dependencies beyond the standard library (SIMD acceleration is optional via CGo).

I built this because I wanted to add local LLM inference to a Go project without shelling out to Python or linking against llama.cpp. The whole thing is go get github.com/computerex/dlgo and you're running models.

It supports LLaMA, Qwen 2/3/3.5, Gemma 2/3, Phi-2/4, SmolLM2, Mistral, and Whisper speech-to-text. Architectures are expressed as a declarative per-layer spec resolved at load time, so adding a new model family is mostly just describing its layer structure rather than writing a new forward pass.

Performance on a single CPU thread with Q4_K_M quantization: ~31 tok/s for LLaMA 3.2 1B, ~48 tok/s for Qwen3 0.6B, ~16 tok/s for Qwen3.5 2B (which has a hybrid attention + Gated Delta Network architecture). Not going to beat llama.cpp on raw speed, but it's fast enough to be useful and the ergonomics of a native Go library are hard to beat.

Supports 25+ GGML quantization formats (Q4_0 through Q8_0, all K-quants, I-quants, F16, BF16, F32). The GGUF parser, dequantization, tokenizer, forward pass, and sampling are all implemented from scratch.

Code: https://github.com/computerex/dlgo