Digest AI vs HN About

KVBoost – chunk-level KV cache reuse for HuggingFace, 5–48x faster TTFT

KVBoost – chunk-level KV cache reuse for HuggingFace, 5–48x faster TTFT

by pythongiant·May 22, 2026·20 points·18 comments

Visit Project View on HN

AI Analysis

●●●BangerBig BrainSolve My ProblemWizardry

Runs 32B models on 8GB VRAM by streaming weights, beating vLLM's memory constraints.

Strengths

•Chunk-level hashing allows 80%+ cache hit rates on multi-turn conversations without custom model code.
•AWQ layer streaming fits Qwen2.5-32B into 5.65GB VRAM, enabling consumer GPU inference for massive models.
•Pure Python implementation with ~10k lines means no C++ compilation hell for contributors.

Weaknesses

•Throughput drops to 0.11 tok/s during streaming mode, making it viable only for latency-tolerant batch jobs.
•Relies entirely on HuggingFace ecosystem; no native ONNX or TensorRT export path mentioned.

Category

Target Audience

ML engineers deploying LLMs on consumer hardware or cost-sensitive infrastructure

Similar To

vLLM · TGI · ExLlamaV2

Similar Projects

Developer Tools●●Solid

1gbps Tokenizer written in Assembly. 20x faster than HuggingFace

Handwritten assembly tokenizer claiming 20x speedup over HuggingFace on SSE2.

WizardryNiche Gem

dogmaticdev

322mo ago

Developer Tools●●●Banger

Warp_cache – SIEVE cache in Rust for Python, 25x faster than cachetools

SIEVE cache beats LRU with one-line swap, but only matters if you're bottlenecked on cache.

WizardryBig Brain

tolopalmer

204mo ago

Developer Tools●●Solid

Ccmd – TUI to audit and clean developer caches

CVE scanning for cached packages beats plain disk cleanup tools.

Solve My ProblemNiche Gem

julsimon

203mo ago

AI/ML●●●Banger

oMLX – SSD-backed KV cache cuts coding agent TTFT from 90s to 1s on Mac

SSD-backed KV cache cuts coding agent TTFT from 90s to 1s, packed in a native macOS app.

WizardrySlick

jundot

414mo ago

AI/ML●●Solid

Nexa-gauge – Cache/cost-aware graph-based eval for LLM and RAG

Cache-aware execution cuts eval costs while tracking grounding and relevance metrics.

Solve My ProblemSlick

Sardhendu

302mo ago

Infrastructure●●Solid

Bytery – a binary JSON protocol ~10x faster and ~10x smaller

Binary JSON with table reuse, but CBOR and MessagePack already own this space.

Big BrainNiche Gem

teamsolid

221mo ago