SparseLab–real sparse training(CSR+custom kernel) in PyTorch, CPU-first
Custom CPU kernels for sparse training when everyone else chases GPU.
Fast Cython/OpenMP-powered 3D volume resampling for NumPy, with PyTorch- and SciPy-compatible nearest, linear, area, cubic, and grid sampling on CPU.
Medical imaging resampling 13× faster than PyTorch—genuine performance engineering.
Medical imaging researchers, computational scientists using Python
PyTorch (interpolate/grid_sample) · NumPy · SciPy.ndimage
Benchmarks (Intel i7, 4 cores, PyTorch 2.8.0):
resample 512³→256³ trilinear: 34 ms vs 55 ms (1.6×) area mode: 65 ms vs 613 ms (9.5×) — PyTorch doesn't parallelize this well int16 nearest: 8 ms vs 93 ms (11×) — PyTorch has no native int16 path (even 13x on single thread) grid_sample 128³: 38 ms vs 169 ms (4.4×) The main wins come from: pre-computed index tables, fused-type specialization (no dtype casting), branchless inner loops, and OpenMP parallelization that actually scales for single-image workloads.
No GPU, no autograd, float32-only for interpolation — just fast CPU resampling with a 2-function API.
pip install volresample
GitHub: https://github.com/JoHof/volresample
If you find it interesting, I wrote about the motivation and some implementation details here: https://johof.github.io/2026/02/volresample-3d-volume-resamp...
Custom CPU kernels for sparse training when everyone else chases GPU.
Pure Rust autodiff + GPU math avoids C++ FFI hell, but matmul claim needs apples-to-apples benchmarks.
Readable Mamba-3 in pure PyTorch solves the trapezoidal discretization cross-boundary dependency without custom kernels.
Matches pyannote on accuracy, runs 8x faster on CPU, no signup—genuine infrastructure win.
350x faster GPU Bloom filter with academic paper backing the performance claims.
CPU-only ONNX transcription when Whisper.cpp already handles this well.