GitHub Repository

High-throughput long-context LLMs. Scaling context via RandNLA and massive vocab capacity through MAXIS Loss and Fisher-SVD.

28 starsPython

MaximusLLM – Train 262k-vocab LLMs on a single 16GB GPU

Name: MaximusLLM – Train 262k-vocab LLMs on a single 16GB GPU
Availability: InStock
Author: yousef_g

by yousef_g·Mar 16, 2026·2 points·0 comments

Visit Project View on HN

AI Analysis

●●●BangerBig BrainWizardryZero to One

Ghost Logit math bypasses 262k vocab OOM without materializing full matrices.

Strengths

•MAXIS Loss achieves 17.5x speedup vs Liger Kernel with 39% VRAM reduction
•RandNLA Attention keeps memory flat as sequence length grows via sketching
•Native Matryoshka embeddings enable 4x faster RAG retrieval without extra indexing

Weaknesses

•Early stage with 10 stars and limited independent verification of benchmark claims
•Narrow audience—only matters if you're pre-training custom large-vocab models

Post Description

Hi HN, I built this because I wanted to see if I could pre-train large-vocabulary LLMs (like Gemma with 262k tokens) on hardware accessible to independent researchers.

Standard exact Cross-Entropy instantly OOMs on 16GB GPUs at that scale.

To bypass this, I implemented MAXIS Loss. It uses a "Ghost Logit" to mathematically simulate the missing probability mass of unsampled tokens, rather than materializing the full 262k-wide matrix.

Benchmarks on a 16GB VRAM card (T4):

17.5x faster in the loss layer compared to the Triton-optimized Liger Kernel.

~39% VRAM reduction in the objective calculation. Includes RandNLA Attention, which uses Causal Kronecker Sketching to keep memory flat as sequence length grows.

I’ve included technical reports with the formal math in the repository. I would love any technical feedback on the partition function simulation or the sketching approach.