Back to browse
GitHub Repository

An inference architecture that makes LLMs stateful. Patent pending (US 64/050,345).

12 stars

An agent that remembers across sessions (no chat history)

by wasnaga·Apr 28, 2026·1 point·0 comments

AI Analysis

●●●BangerBig BrainWizardry

Cuts long-context costs by 90% by swapping disk IO for expensive GPU recomputation.

Strengths
  • Directly manipulates KV caches to bypass the linear cost scaling of transformer attention.
  • Enables true cross-session memory without the information loss of RAG or summarization.
  • Shifts the bottleneck from expensive HBM to cheap NVMe storage for massive savings.
Weaknesses
  • Requires low-level access to model internals, limiting compatibility with closed API providers.
  • Managing state drift and versioning across different model weights adds operational complexity.
Category
Target Audience

Developers building long-running AI agents or context-heavy applications.

Similar To

Mem0 · LangChain Memory · LlamaIndex

Post Description

Hi HN — I built this in my off-hours over the last 3 months. Sharing now because I just filed the provisional patent yesterday (US 64/050,345) and the repo is freshly public.

The frustration that started it: every time I use a coding agent (Cursor, OpenCode, Aider, Claude Code, etc.), it eventually loses context — forgets the SSH address, re-asks for the DB password, tries to redeploy to localhost when the server is remote. The "proper" answer is "set up 10 specialized agents with short context windows." I'm too lazy for that.

The conventional architecture is the actual problem. Every turn re-sends the full conversation, the model recomputes attention from scratch, and cost compounds with conversation length. Long-running agents are economically infeasible by design.

What I built: NLS captures the model's own computed K/V states (and recurrent states for hybrid models like Qwen3.5-MoE) after each turn, persists them to disk, and re-injects them into the cache on the next turn — at the right positions, with proper alignment. The model behaves as if it had the full conversation in context, but the conversation is never re-sent.

Validated across three settings, in increasing order of stringency:

(1) Standard conversational recall: 5/5 on a 5-fact production test. Baseline check.

(2) LongMemEval (published cross-session benchmark, ~19K sessions). On the 18-question "fully answerable" subset:

Condition Qwen 3.5 Qwen 3.6 Memories provided as TEXT in the prompt 8/18 9/18 Same memories delivered as KV-state via NLS 8/18 9/18

Text and KV produce identical scores. Both fail the same 9-10 questions for the same reasons (multi-hop temporal reasoning that exceeds model capacity). When the architecture's inputs are equivalent, the outputs are equivalent.

(3) Real agentic loop with OpenCode (TUI coding agent, used NLS as its OpenAI-compatible backend). It scaffolded a multi-phase coding project ("ICF Coaching Evaluation Tool"). Then in a separate session, after a full TUI restart with no chat history, I asked "what's the project about?" — it returned a rich, specific description naming the project, the stack, and the architectural decisions. 124 user-typed tokens delivered 18,751 tokens of stored prior-session context. 99.3% prompt-token savings on the recall path. 4/4 recall across the test scenarios.

Honest caveats: - The plugin source is proprietary (patent pending). The repo has docs, benchmarks, journey — not the implementation. - Single-GPU validation. Multi-GPU not tested yet. - Solo, no team yet. - Provisional patent only — non-provisional and PCT in the next 12 months.

What I want from this thread: tell me where you'd stress-test it. What workload breaks it? Anyone here from an inference provider — does this overlap with what your stack already does, or is this a new place?

Demo (conversational): https://punkrecords.live Demo (agentic, OpenAI-compatible): https://api.punkrecords.live/v1

Similar Projects

AI/ML●●●Banger

Stateful Inference with 99% Token Savings

Injects raw KV tensors directly into model cache to skip 90% of token recomputation.

Big BrainBold Bet
wasnaga
201mo ago
AI/MLMid

Residuum | Agentic AI with continuous context

OpenClaw fork with continuous memory instead of RAG — solves a real pain, not groundbreaking.

Ship It
BearFlinn
103mo ago
Developer Tools●●Solid

Mimir – Shared memory and inter-agent messaging for Claude Code swarms

Mimir hooks into Claude Code lifecycle events so agents can 'mark' facts (e.g., "API uses snake_case") into a DuckDB-backed memory and RAG pipeline, then auto-injects that context as additionalContext for later agents. It's a pragmatic, well-scoped solution to the annoying problem of agent amnesia — very useful if you run agent swarms, but its impact is limited by Claude Code adoption and the need for the surrounding infra (BGE keys, hooks).

Niche GemShip It
deejaydev
213mo ago