TurboQuant for vector search – 2-4 bit compression
Data-oblivious quantization beats Product Quantization on online updates.
Standalone TurboQuant KV Cache Inference for https://huggingface.co/g023/Qwen3-1.77B-g023
Standalone KV cache compression script implementing TurboQuant with 1.55x ratio.
ML researchers, LLM inference engineers
vLLM · TensorRT-LLM · SGLang
- uses https://huggingface.co/g023/Qwen3-1.77B-g023 as the demonstration model (throw model files in Qwen3-BEST folder)
Data-oblivious quantization beats Product Quantization on online updates.
Google's ICLR 2026 quantization paper running client-side with SIMD-accelerated dot products.
Single-file C++ ANS kernel beats wrestling with zstandard for quantized data.
Files are single-purpose and readable: each algorithm comes with docstrings, type hints, complexity notes and runnable examples so you can read, test, or pip-install bits immediately. It isn't breaking new ground — algorithm collections are common — but the focus on clarity, tests, and a tiny surface API (merge_sort, BinaryHeap, dijkstra, etc.) makes this a reliable reference and teaching aid.
Custom CUDA kernels for SSM recurrence with zero framework dependencies.
Training-free dual-memory protocol cuts 1792p SigLIP inference from 678ms to 11.9ms.