Back to browse
GitHub Repository

Persistent KV cache with content-hash addressing for tool-augmented LLMs

21 starsPython

ContextCache – Cache tool schema KV states, skip 99% of prefill tokens

by spranab·Mar 4, 2026·1 point·0 comments

AI Analysis

●●●BangerWizardryBig Brain

KV-cache tool schemas once, reuse across requests: 29.2x speedup on 50 tools, flat 200ms TTFT.

Strengths
  • Genuine inference optimization insight: tool schemas are static; caching them once is obvious in hindsight, novel in practice.
  • Cross-LLM: works with Ollama, Claude, OpenAI, DeepSeek, Groq, xAI—not locked into one vendor.
  • Includes two real products: route-only (~500ms tool detection) and full pipeline (route→extract→execute→synthesize).
Weaknesses
  • Requires self-hosting or managing your own LLM backend; no managed SaaS option limits reach to engineers comfortable with ops.
  • GPU benchmarks dominate; CPU-only path exists but slower—not proven at scale for high-traffic multi-user systems.
Target Audience

AI engineers building multi-tool LLM agents, inference optimization teams

Post Description

Every tool-calling LLM request resends the full tool schemas through prefill. With 50 tools that's ~6,000 tokens reprocessed on every request, for every user, even though the tools never change.

ContextCache compiles tool schemas into a KV cache once and reuses it across all requests. Only the user query goes through prefill.

Results (Qwen3-8B, RTX 3090 Ti): - 50 tools: 5,625ms → 193ms (29.2x speedup) - Zero quality degradation (TSA 0.850 matches full prefill exactly)

Also includes a CPU-only orchestrator (no GPU needed) using llama.cpp + Qwen3.5-2B that routes queries to the right tool in ~550ms. Works with any LLM backend — Ollama, Claude, OpenAI, xAI, DeepSeek, Groq, or self-hosted.

Two products from one project: - Route-only (~500ms): just tool detection, no LLM needed - Full pipeline (~3s): route → extract params → execute → synthesize

Open source (CC BY 4.0), paper included.

Similar Projects

AI/ML●●●Banger

Thaw – Git branch for a running LLM (fork agents, skip prefill)

Git branch for LLM agents — 400x faster forking with preserved KV cache.

WizardryBig BrainSolve My Problem
nilsmatteson
3019d ago