Thaw – Git branch for a running LLM (fork agents, skip prefill)
Git branch for LLM agents — 400x faster forking with preserved KV cache.
Persistent KV cache with content-hash addressing for tool-augmented LLMs
KV-cache tool schemas once, reuse across requests: 29.2x speedup on 50 tools, flat 200ms TTFT.
AI engineers building multi-tool LLM agents, inference optimization teams
ContextCache compiles tool schemas into a KV cache once and reuses it across all requests. Only the user query goes through prefill.
Results (Qwen3-8B, RTX 3090 Ti): - 50 tools: 5,625ms → 193ms (29.2x speedup) - Zero quality degradation (TSA 0.850 matches full prefill exactly)
Also includes a CPU-only orchestrator (no GPU needed) using llama.cpp + Qwen3.5-2B that routes queries to the right tool in ~550ms. Works with any LLM backend — Ollama, Claude, OpenAI, xAI, DeepSeek, Groq, or self-hosted.
Two products from one project: - Route-only (~500ms): just tool detection, no LLM needed - Full pipeline (~3s): route → extract params → execute → synthesize
Open source (CC BY 4.0), paper included.
Git branch for LLM agents — 400x faster forking with preserved KV cache.
Keeps agent memory at 8 KB constant size while KV caches bloat to 156 MB.
RSS middleware that skips podcast reruns and paces historical archives automatically.
RDMA-backed distributed KV cache cuts prefill latency 3.1× where vLLM's built-in caching maxes out.
Applies CPU cache coherence protocols to multi-agent LLM synchronization—clever analogy.
Tower-style middleware stacking for inference guardrails beats bolted-on if-statements.