Back to browse
GitHub Repository

Mamba SSM and Mamba-3 SISO in Rust with optional CUDA GPU acceleration. Inference and training (BPTT through SSM state, AdamW), CPU + GPU paths, custom CUDA kernels, CUDA Graph capture, f32 / bf16 / f16. Batch-invariant bf16 inference — per-row output is bit-identical across batch sizes.

10 starsRust

Mamba SSM in Rust – training and inference with custom CUDA kernels

by silvermpx·Mar 23, 2026·1 point·0 comments

AI Analysis

●●SolidWizardryNiche Gem

Custom CUDA kernels for SSM recurrence with zero framework dependencies.

Strengths
  • Full BPTT through recurrent SSM state enables actual training, not just inference.
  • Zero-allocation single-step inference hits ~200μs on CPU without GPU.
  • Standalone design means no PyTorch, Burn, or Candle dependency chain.
Weaknesses
  • Mamba implementations already exist in multiple languages; Rust isn't unique.
  • No benchmark comparisons against official Mamba or other ports.
Category
Target Audience

ML engineers wanting Rust-based SSM implementations

Similar To

mamba-minimal · Candle · Burn

Similar Projects

AI/ML●●●Banger

Glq LLM quantization using E8 lattice

E8 lattice codebooks beat GPTQ at 2-4 bpw with fused CUDA kernel skipping weight materialization.

WizardryBig Brain
acd
2012d ago
AI/ML●●●Banger

Mamba3-minimal – PyTorch implementation of Mamba-3

Readable Mamba-3 in pure PyTorch solves the trapezoidal discretization cross-boundary dependency without custom kernels.

Big BrainWizardryNiche Gem
vikramkarlex
103mo ago