Trained a 12M transformer on an ML framework we built from scratch
Custom CUDA kernels and Rust backend with a TypeScript API built by students in four months.
Mamba SSM and Mamba-3 SISO in Rust with optional CUDA GPU acceleration. Inference and training (BPTT through SSM state, AdamW), CPU + GPU paths, custom CUDA kernels, CUDA Graph capture, f32 / bf16 / f16. Batch-invariant bf16 inference — per-row output is bit-identical across batch sizes.
Custom CUDA kernels for SSM recurrence with zero framework dependencies.
ML engineers wanting Rust-based SSM implementations
mamba-minimal · Candle · Burn
Custom CUDA kernels and Rust backend with a TypeScript API built by students in four months.
INT4 inference engine beats llama.cpp on VRAM, but competing against established tools.
30x faster cold start than vLLM with zero PyTorch dependencies.
E8 lattice codebooks beat GPTQ at 2-4 bpw with fused CUDA kernel skipping weight materialization.
Readable Mamba-3 in pure PyTorch solves the trapezoidal discretization cross-boundary dependency without custom kernels.
Build vLLM from scratch with PagedAttention kernels when llama.cpp already exists.