Digest AI vs HN About

GitHub Repository

Mamba SSM and Mamba-3 SISO in Rust with optional CUDA GPU acceleration. Inference and training (BPTT through SSM state, AdamW), CPU + GPU paths, custom CUDA kernels, CUDA Graph capture, f32 / bf16 / f16. Batch-invariant bf16 inference — per-row output is bit-identical across batch sizes.

13 starsRust

Mamba SSM in Rust – training and inference with custom CUDA kernels

by silvermpx·Mar 23, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●●SolidWizardryNiche Gem

Custom CUDA kernels for SSM recurrence with zero framework dependencies.

Strengths

•Full BPTT through recurrent SSM state enables actual training, not just inference.
•Zero-allocation single-step inference hits ~200μs on CPU without GPU.
•Standalone design means no PyTorch, Burn, or Candle dependency chain.

Weaknesses

•Mamba implementations already exist in multiple languages; Rust isn't unique.
•No benchmark comparisons against official Mamba or other ports.

Category

Target Audience

ML engineers wanting Rust-based SSM implementations

Similar To

mamba-minimal · Candle · Burn

Similar Projects

Developer Tools●●●Banger

Trained a 12M transformer on an ML framework we built from scratch

Custom CUDA kernels and Rust backend with a TypeScript API built by students in four months.

WizardryBig BrainShip It

caliandbust

223mo ago

Infrastructure●●●Banger

a Rust OS kernel built for LLM inference

Custom OS kernel for inference cuts layer streaming from 1.4s to 42μs.

WizardryZero to OneBold Bet

Kanchisaw

3028d ago

Developer Tools●●●Banger

cuTile Rust: Safe, data-race-free GPU kernels in Rust

Extends Rust's ownership model across GPU boundary with tile-based partitioning for data-race-free kernels.

WizardryBig BrainNiche Gem

melihelibol

106181mo ago

Infrastructure●●Solid

ZSE – Single-file LLM engine with dual INT4 kernels

INT4 inference engine beats llama.cpp on VRAM, but competing against established tools.

WizardryShip It

zyoralabs

104mo ago

AI/ML●●●Banger

We built an LLM inference engine in pure Python – no PyTorch, no Triton

30x faster cold start than vLLM with zero PyTorch dependencies.

WizardryBig BrainZero to One

zyoraclub

201mo ago

AI/ML●●●Banger

Glq LLM quantization using E8 lattice

E8 lattice codebooks beat GPTQ at 2-4 bpw with fused CUDA kernel skipping weight materialization.

WizardryBig Brain

acd

201mo ago