I built a 2nd-order PyTorch optimizer for LLMs that runs on 16GB GPUs
Runs Shampoo-quality second-order optimization on a 16GB T4 where others OOM immediately.
High-throughput long-context LLMs. Scaling context via RandNLA and massive vocab capacity through MAXIS Loss and Fisher-SVD.
Ghost Logit math bypasses 262k vocab OOM without materializing full matrices.
ML researchers, independent AI developers
Liger Kernel · FlashAttention · Axolotl
Standard exact Cross-Entropy instantly OOMs on 16GB GPUs at that scale.
To bypass this, I implemented MAXIS Loss. It uses a "Ghost Logit" to mathematically simulate the missing probability mass of unsampled tokens, rather than materializing the full 262k-wide matrix.
Benchmarks on a 16GB VRAM card (T4):
17.5x faster in the loss layer compared to the Triton-optimized Liger Kernel.
~39% VRAM reduction in the objective calculation. Includes RandNLA Attention, which uses Causal Kronecker Sketching to keep memory flat as sequence length grows.
I’ve included technical reports with the formal math in the repository. I would love any technical feedback on the partition function simulation or the sketching approach.
Runs Shampoo-quality second-order optimization on a 16GB T4 where others OOM immediately.
One-command benchmark suite comparing Ollama and XGBoost performance with a shared Streamlit dashboard.
Automates the painful torch.compile and mixed-precision tuning loop with measured 3x speedups.
SETI@home for LLMs where agents coordinate hyperparameter searches across volunteer GPUs.
3.11x speedup on minGPT with automated LLM-suggested rewrites.
Estimates LLM training MFU, memory, timeline across 70 models and parallelism strategies—genuinely useful before GPUs commit.