MaximusLLM, Breaking transformer's O(N^2) and O(V) scaling bottlenecks
Claims 17.5x training speedup with Matryoshka embeddings for native RAG.
A Geometric Attention Transformer with the E8 Root System: Sovereign-Lila-E8 (Lie Lattice Attention Language Model)
E8 lattice geometry replaces attention—clever math, but TinyStories 0.37 loss needs context.
ML researchers interested in attention mechanisms and geometric deep learning
Mixture of Experts approaches · Efficient attention variants (FlashAttention)
I built Sovereign-Lila-E8 because I wanted to see if we could bypass the 'viscosity' of standard attention mechanisms using higher-dimensional geometry.
Most small models today are just distilled copies of larger ones. LILA-E8 is different: it implements a native E8 Root System Lattice directly into the attention weights. By using the densest sphere packing in 8 dimensions, we minimize semantic friction (information loss) in the latent space.
The Results:
Efficiency: 40M parameters achieving 0.37 Train / 0.44 Val Loss on the TinyStories dataset (outperforming standard 60M baselines). Stability: Sustained coherence for 1000+ tokens without the common semantic looping seen in small-scale transformers. By implementing the E8 exceptional Lie algebra directly into the attention weights, I’ve achieved a state of "Geometric Resonance" that standard transformers simply cannot reach. At 200,000 steps, the model achieved a state of 'Geometric Resonance'—a phase shift in quality that typically requires 2-3x more parameters in standard architectures. I’ve provided a 1-click Google Colab for instant verification of the weights and generation quality. GitHub: https://github.com/SPUTNIKAI/sovereign-lila-e8 Colab: https://colab.research.google.com/github/SPUTNIKAI/sovereign... Zenodo: (Preprint): https://zenodo.org/records/18731736
Looking for feedback on expanding the context window to 4096 and potentially porting this to the 24D Leech Lattice. (see also https://zenodo.org/records/18729723 )
Claims 17.5x training speedup with Matryoshka embeddings for native RAG.
Ghost Logit math bypasses 262k vocab OOM without materializing full matrices.
Train a working LLM in 5 minutes on free Colab with a fish personality.
E8 lattice codebooks beat GPTQ at 2-4 bpw with fused CUDA kernel skipping weight materialization.
Fused int4 attention kernel on Metal keeps LLM speed constant as context grows.
Runs a 1.7B LLM offline on Apple Watch using 1-bit quantization.