We built an LLM inference engine in pure Python – no PyTorch, no Triton
30x faster cold start than vLLM with zero PyTorch dependencies.
Golang inference engine and deep learning primitives
Pure Go LLM inference, zero dependencies, 48 tok/s—genuinely novel for Go ecosystem.
Go developers needing local LLM inference without Python/C++ dependencies
llama.cpp · Ollama · tinygrad
I built this because I wanted to add local LLM inference to a Go project without shelling out to Python or linking against llama.cpp. The whole thing is go get github.com/computerex/dlgo and you're running models.
It supports LLaMA, Qwen 2/3/3.5, Gemma 2/3, Phi-2/4, SmolLM2, Mistral, and Whisper speech-to-text. Architectures are expressed as a declarative per-layer spec resolved at load time, so adding a new model family is mostly just describing its layer structure rather than writing a new forward pass.
Performance on a single CPU thread with Q4_K_M quantization: ~31 tok/s for LLaMA 3.2 1B, ~48 tok/s for Qwen3 0.6B, ~16 tok/s for Qwen3.5 2B (which has a hybrid attention + Gated Delta Network architecture). Not going to beat llama.cpp on raw speed, but it's fast enough to be useful and the ergonomics of a native Go library are hard to beat.
Supports 25+ GGML quantization formats (Q4_0 through Q8_0, all K-quants, I-quants, F16, BF16, F32). The GGUF parser, dequantization, tokenizer, forward pass, and sampling are all implemented from scratch.
30x faster cold start than vLLM with zero PyTorch dependencies.
Custom GGUF parser with mmap beats llama.cpp load times, but zero stars means unproven claims.
Zero-trust networking via zrok beats LiteLLM when your GPUs sit behind NAT.
SQLite-based LLM inference hitting 210MB RSS beats OS paging with deterministic memory control.
Simulates governance policies without CUDA kernels or real vLLM schedulers.
Unified API gateway for Ollama + vLLM with real-time GPU telemetry and drain mode.