Digest AI vs HN About

GitHub Repository

49 starsRust

Nabla – Pure Rust GPU math engine, 7.5× faster matmul than PyTorch

by fumishiki·Mar 1, 2026·1 point·1 comment

Visit Project View on HN

AI Analysis

●●SolidWizardryBig Brain

Pure Rust autodiff + GPU math avoids C++ FFI hell, but matmul claim needs apples-to-apples benchmarks.

Strengths

•Pure Rust implementation eliminates C++ FFI complexity and dependency bloat, enabling true cross-platform CUDA/Vulkan/AMD kernels from one codebase
•Loss-backward() API and kernel fusion (fuse!()) with einsum! macros reduce boilerplate vs hand-rolled CUDA
•Benchmark on GH200 is credible hardware, though TF32 vs FP32 comparison requires scrutiny

Weaknesses

•7.5× speedup is precision-qualified (TF32 vs PyTorch default); apples-to-apples FP32 shows only 1.6× advantage
•Minimal ecosystem maturity: only 7 GitHub stars, unclear adoption, and no real-world training examples beyond micro benchmarks

Category

Developer Tools

Target Audience

Rust developers building ML systems, inference pipelines, and numerical computing

Similar To

PyTorch · tch-rs · Candle

Similar Projects

Developer Tools●●●Banger

mtp-rs – pure-Rust MTP library, up to 4x faster than libmtp

Pure-Rust MTP library beats libmtp 4x faster with no C dependencies.

Solve My ProblemBig Brain

vdavid

312mo ago

Developer Tools●●Solid

NeuralScript – A pure-Rust AOT compiler

Compile-time tensor shape checking beats PyTorch's runtime dimension errors.

Bold BetBig Brain

AkaiNa

501mo ago

Data●●●●Gem

Cuckoo-GPU – A 350x faster Bloom filter alternative for GPUs

350x faster GPU Bloom filter with academic paper backing the performance claims.

WizardryBig BrainDark Horse

tdortman

112mo ago

AI/ML●●●Banger

TRELLIS.2 image-to-3D running on Mac Silicon – no Nvidia GPU needed

Runs 4B-parameter image-to-3D on Mac without CUDA—Microsoft's original required NVIDIA only.

WizardryNiche GemZero to One

shivampkumar

202401mo ago

AI/ML●●●Banger

I built a 2nd-order PyTorch optimizer for LLMs that runs on 16GB GPUs

Runs Shampoo-quality second-order optimization on a 16GB T4 where others OOM immediately.

WizardryBig Brain

dnosoz

241mo ago

Developer Tools●●●Banger

Profine – Profile and rewrite your PyTorch training loop on real GPUs

Automates the painful torch.compile and mixed-precision tuning loop with measured 3x speedups.

Big BrainSolve My Problem

aisinghal

401mo ago