Back to browse
GitHub Repository

E8 lattice codebook quantization for LLM weights — 2/3/4 bpw with fused Triton inference kernel

5 starsPython

Glq LLM quantization using E8 lattice

by acd·Jun 1, 2026·2 points·0 comments

AI Analysis

●●●BangerWizardryBig Brain

E8 lattice codebooks beat GPTQ at 2-4 bpw with fused CUDA kernel skipping weight materialization.

Strengths
  • E8 lattice geometry enables near-optimal Euclidean search after Hadamard decorrelation.
  • Fused Triton kernel matmuls against compressed indices without dequantizing weights.
  • Six pre-quantized models on HuggingFace including 24B Devstral and 30B Nemotron.
Weaknesses
  • Only 3 GitHub stars despite claiming better quality than established quantization methods.
  • CPU fallback uses naive dequantize-then-matmul, limiting accessibility without CUDA.
Category
Target Audience

ML engineers running LLMs on memory-constrained GPUs

Similar To

GPTQ · QuIP# · AWQ

Post Description

I have with the help of AI create an open source method of E8 LLM code book quantization library called glq. I was interested in creating Glq as a PC gamer and devops, interested in both LLMs and AI. The current high RAM prices and LLM resource usage also inspired me to write glq. A question arises could you try and squeeze more out a gaming GPU with limited VRAM size by using alternative LLM compression methods?

Glq is effective compared to other LLM quantization algorithms at between 2-bits per weight up to 4 bits per weight. The effectiveness of glq at low bits per words is due to the properties of the E8 lattice compared to linear methods. Glq also supports mixed precision quantization where different LLM layers uses different compression bit weight depending on how sensitive the LLM layers are to quantization. Think of mixed precision a bit like MP3 or MP4 variable bit rate encoding.

I currently develop glq using g7e AWS spot instances to keep the cost more reasonable.

Glq uses vllm

4 bit Key value cache by E8 was inspired by NexusQuant. I try and squeeze in about four times as much Key value cache as normally would fit by BF16 in VRAM, or about two times compared to INT8.

I somehow wrongly at start picked a E8 code book size of 65536 entries instead of 4096 code book entries which better fits in GPU L1 cache. Having 65535 code book entries it turns out leads to higher LLM compression rate but at trade of of decode speed. I am trying to compensate by using Nvidia Cuda graphs and optimize the decode, currently work in progress.

To install glq in a python virtual environment on Linux with a Nvidia GPU: pip install glq

Python PIP package https://pypi.org/project/glq/

Glq source code. https://github.com/cnygaard/glq

Current PC RAM Prices that inspired the library. https://pcpartpicker.com/trends/price/memory/

https://en.wikipedia.org/wiki/E8_lattice Eight dimensional lattice that provides optimal solution to the sphere packing problems. Think about it a bit like stacking cannon balls or stacking apples in an optimal way. Only you swap the apples for LLM weights.

Picture of an E8 lattice https://en.wikipedia.org/wiki/E8_polytope#/media/File:E8_gra...

Credits: GLQ was inspired by E8 Quip# and Key value E8 compression was inspired by NexusQuant.

Math: The sphere packing problem in dimension 8, Maryna Viazovska https://arxiv.org/abs/1603.04246

4bpw glq Quantization of Gemma 4 E4b-instruction tuned https://huggingface.co/xv0y5ncu/Gemma-4-E4B-it-GLQ-4bpw

3.5bpw mixed precision quantization of SmolLM3 https://huggingface.co/xv0y5ncu/SmolLM3-3B-GLQ-3.5bpw

Docker image of glq on Nvidia GPU with Nvidia container toolkit. docker run --rm --gpus all \ -v "$HOME/.cache/huggingface:/cache/hf" \ ghcr.io/cnygaard/glq-env:0.5.0 \ python -c ' import glq.hf_integration, torch # registers GLQ with HF from transformers import AutoModelForCausalLM, AutoTokenizer mid = "xv0y5ncu/SmolLM3-3B-GLQ-3.5bpw" tok = AutoTokenizer.from_pretrained(mid) model = AutoModelForCausalLM.from_pretrained( mid, device_map="cuda", torch_dtype=torch.float16) ids = tok("The capital of France is", return_tensors="pt").to("cuda") print(tok.decode(model.generate(*ids, max_new_tokens=20)[0], skip_special_tokens=True)) '

Currently work in progress on glq in getting the decode speed up and supporting more LLM model architectures.

Open question, Does glq work on Nvidia DGX spark and gaming Nvidia hardware such as 4070-5090?

Similar Projects

AI/MLMid

My "home rig" for iterative attribute-weighted LLM benchmarking

Home rig for attribute-weighted benchmarking lacks the polish of established eval frameworks.

Ship It
yuvalhaim
211mo ago
AI/ML●●Solid

LLMs' Favorite Colors

Sampling hex codes from 10 LLMs exposes bias patterns useful for fingerprinting.

Rabbit HoleNiche Gem
gimlids
202mo ago