GitHub Repository

An inference architecture that makes LLMs stateful. Patent pending (US 64/050,345).

14 stars

An agent that remembers across sessions (no chat history)

Name: An agent that remembers across sessions (no chat history)
Availability: InStock
Author: wasnaga

by wasnaga·Apr 28, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●●●BangerBig BrainWizardry

Cuts long-context costs by 90% by swapping disk IO for expensive GPU recomputation.

Strengths

•Directly manipulates KV caches to bypass the linear cost scaling of transformer attention.
•Enables true cross-session memory without the information loss of RAG or summarization.
•Shifts the bottleneck from expensive HBM to cheap NVMe storage for massive savings.

Weaknesses

•Requires low-level access to model internals, limiting compatibility with closed API providers.
•Managing state drift and versioning across different model weights adds operational complexity.

Post Description

Hi HN — I built this in my off-hours over the last 3 months. Sharing now because I just filed the provisional patent yesterday (US 64/050,345) and the repo is freshly public.

The frustration that started it: every time I use a coding agent (Cursor, OpenCode, Aider, Claude Code, etc.), it eventually loses context — forgets the SSH address, re-asks for the DB password, tries to redeploy to localhost when the server is remote. The "proper" answer is "set up 10 specialized agents with short context windows." I'm too lazy for that.

The conventional architecture is the actual problem. Every turn re-sends the full conversation, the model recomputes attention from scratch, and cost compounds with conversation length. Long-running agents are economically infeasible by design.

What I built: NLS captures the model's own computed K/V states (and recurrent states for hybrid models like Qwen3.5-MoE) after each turn, persists them to disk, and re-injects them into the cache on the next turn — at the right positions, with proper alignment. The model behaves as if it had the full conversation in context, but the conversation is never re-sent.

Validated across three settings, in increasing order of stringency:

(1) Standard conversational recall: 5/5 on a 5-fact production test. Baseline check.

(2) LongMemEval (published cross-session benchmark, ~19K sessions). On the 18-question "fully answerable" subset:

Condition Qwen 3.5 Qwen 3.6 Memories provided as TEXT in the prompt 8/18 9/18 Same memories delivered as KV-state via NLS 8/18 9/18

Text and KV produce identical scores. Both fail the same 9-10 questions for the same reasons (multi-hop temporal reasoning that exceeds model capacity). When the architecture's inputs are equivalent, the outputs are equivalent.

(3) Real agentic loop with OpenCode (TUI coding agent, used NLS as its OpenAI-compatible backend). It scaffolded a multi-phase coding project ("ICF Coaching Evaluation Tool"). Then in a separate session, after a full TUI restart with no chat history, I asked "what's the project about?" — it returned a rich, specific description naming the project, the stack, and the architectural decisions. 124 user-typed tokens delivered 18,751 tokens of stored prior-session context. 99.3% prompt-token savings on the recall path. 4/4 recall across the test scenarios.

Honest caveats: - The plugin source is proprietary (patent pending). The repo has docs, benchmarks, journey — not the implementation. - Single-GPU validation. Multi-GPU not tested yet. - Solo, no team yet. - Provisional patent only — non-provisional and PCT in the next 12 months.

What I want from this thread: tell me where you'd stress-test it. What workload breaks it? Anyone here from an inference provider — does this overlap with what your stack already does, or is this a new place?

Demo (conversational): https://punkrecords.live Demo (agentic, OpenAI-compatible): https://api.punkrecords.live/v1

Similar Projects

AI/ML●●●Banger

Stateful Inference with 99% Token Savings

Injects raw KV tensors directly into model cache to skip 90% of token recomputation.

Big BrainBold Bet

wasnaga

202mo ago

Developer Tools●Mid

Cognitive architecture for Claude Code – triggers, memory, docs

Trigger-based cognitive architecture for Claude Code loses context anyway without API-level state persistence.

Big BrainShip It

9wzYQbTYsAIc

304mo ago

AI/ML●●Solid

Infer – Pipe friendly Agent Harness with one tool: Bash

Unix philosophy applied to LLM agents — stdin context, bash tool, JSONL sessions.

CozyBig Brain

turlockmike

303mo ago

AI/ML●Mid

Residuum | Agentic AI with continuous context

OpenClaw fork with continuous memory instead of RAG — solves a real pain, not groundbreaking.

Ship It

BearFlinn

104mo ago

AI/ML●●Solid

Maccha – Cross Agent Brain for Antigravity, Claude Code, OpenCode etc.

File-based agent memory that works across Claude Code and Antigravity without a daemon.

Big BrainNiche Gem

kareldecherf

5229d ago

Developer Tools●●●Banger

Gorchestra – resume local AI coding sessions from your phone

Mobile web UI for local AI agents with persistent session history and git inspection.

Solve My ProblemShip ItNiche Gem

cybrjoe

301mo ago