Librarian – Cut token costs by up to 85% for LangGraph and OpenClaw

Name: Librarian – Cut token costs by up to 85% for LangGraph and OpenClaw
Availability: InStock
Author: Pinkert

by Pinkert·Feb 26, 2026·8 points·7 comments

Visit Project View on HN

AI Analysis

●●●BangerWizardrySolve My ProblemBig Brain

Async lightweight summarization + temporal-aware selection beats vector RAG for agent context scaling.

Strengths

•Temporal reasoning beats vector embeddings: understands causality and conversation dependencies, not just semantic similarity — solves Lost-in-the-Middle without sacrificing logic
•Benchmarked with reproducible methodology: 82% accuracy vs. 78% brute-force, 85% cost reduction at turn 50, publicly verifiable dataset — claims have receipts
•Drop-in pip integration for LangGraph/OpenClaw means zero code refactor; indexes asynchronously so UX never degrades — shipping maturity evident

Weaknesses

•Targets narrow use case (agentic loops with long histories) — doesn't help short-context, one-shot LLM calls or RAG where vector search is already optimal
•Dependent on two specific frameworks gaining adoption; if LangGraph ecosystem fragments or OpenClaw dies, tool becomes niche

Post Description

Hi HN,

I'm building Librarian (https://uselibrarian.dev/), an open-source (MIT) context management tool that stops AI agents from burning tokens by blindly re-reading their entire conversation history on every turn.

The Problem: If you're building agentic loops in frameworks like LangGraph or OpenClaw, you hit two walls fast:

Financial Cost: Token usage scales quadratically over long conversations. Passing the whole history every time gets incredibly expensive.

Context Rot: As the context window fills up, the LLM suffers from the "Lost in the Middle" effect. Response latency spikes, and reasoning accuracy drops.

The standard workaround is vector search (RAG) over past messages, but that completely loses temporal logic and conversational dependencies.

How Librarian Fixes This: We replaced brute-force context windowing with a lightweight reasoning pipeline:

Index: After a message, a smaller model asynchronously creates a compressed summary (~100 tokens), building an index of the conversation.

Select: When a new prompt arrives, Librarian reads the summary index and reasons about which specific historical messages are actually relevant to the current turn.

Hydrate: It fetches only those selected messages and passes them to the responder.

The Results: Instead of passing 2,000+ tokens of noise, you pass a highly curated context of ~800 tokens. In our 50-turn benchmarks, this reduces token costs by up to 85% while actually increasing answer accuracy (82% vs 78% for brute-force) because the distracting noise is removed. It currently works as a drop-in integration for LangGraph and OpenClaw.

I'd love for you to check out the benchmark suite, try the integrations, and tear the methodology apart. I'll be hanging out in the comments to answer questions, debug, or hear why this approach is terrible. Thanks!