ContextCache – Cache tool schema KV states, skip 99% of prefill tokens

Name: ContextCache – Cache tool schema KV states, skip 99% of prefill tokens
Availability: InStock
Author: spranab

by spranab·Mar 4, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●●●BangerWizardryBig Brain

KV-cache tool schemas once, reuse across requests: 29.2x speedup on 50 tools, flat 200ms TTFT.

Strengths

•Genuine inference optimization insight: tool schemas are static; caching them once is obvious in hindsight, novel in practice.
•Cross-LLM: works with Ollama, Claude, OpenAI, DeepSeek, Groq, xAI—not locked into one vendor.
•Includes two real products: route-only (~500ms tool detection) and full pipeline (route→extract→execute→synthesize).

Weaknesses

•Requires self-hosting or managing your own LLM backend; no managed SaaS option limits reach to engineers comfortable with ops.
•GPU benchmarks dominate; CPU-only path exists but slower—not proven at scale for high-traffic multi-user systems.

Post Description

Every tool-calling LLM request resends the full tool schemas through prefill. With 50 tools that's ~6,000 tokens reprocessed on every request, for every user, even though the tools never change.

ContextCache compiles tool schemas into a KV cache once and reuses it across all requests. Only the user query goes through prefill.

Results (Qwen3-8B, RTX 3090 Ti): - 50 tools: 5,625ms → 193ms (29.2x speedup) - Zero quality degradation (TSA 0.850 matches full prefill exactly)

Also includes a CPU-only orchestrator (no GPU needed) using llama.cpp + Qwen3.5-2B that routes queries to the right tool in ~550ms. Works with any LLM backend — Ollama, Claude, OpenAI, xAI, DeepSeek, Groq, or self-hosted.

Two products from one project: - Route-only (~500ms): just tool detection, no LLM needed - Full pipeline (~3s): route → extract params → execute → synthesize

Open source (CC BY 4.0), paper included.