Digest AI vs HN About

GitHub Repository

An open benchmark for the failure modes of agent memory systems: retraction, collision, recall, conflict. Offline, zero-dependency, reproducible.

0 starsTypeScript

A benchmark for the failure modes of agent memory

by Pankhi123·Jun 27, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●●●BangerBig BrainDark Horse

Shows retrieval metrics lie—answer correctness ranges from 23% to 92% across identical scores.

Strengths

•Four failure modes expose bugs that make agents confidently wrong, not just retrieval misses
•Zero dependencies and offline execution means anyone can reproduce leaderboard in one command
•Baselines reveal typed-constraint beats keyword and recency by modeling time and identity

Weaknesses

•Narrow audience—only matters if you're building agent memory systems
•Benchmark only, not a tool you'd integrate into production workflows

Category

Target Audience

AI agent developers, ML engineers building memory systems, researchers

Similar To

HELM · AgentBench · GAIA

Similar Projects

AI/ML●Mid

Named failure modes that stop AI agents from cutting corners

Named failure modes for AI agents, but it's markdown files—not an actual tool or implementation.

Big BrainNiche Gem

travisdrake

103mo ago

Security●●●Banger

AVE Database Open taxonomy of 50 failure modes in multi-agent AI systems

First structured CVE-style database for AI agent failures—nobody else is doing this.

Zero to OneDark HorseBig Brain

neuravant

103mo ago

AI/ML●●●Banger

96.2% on LongMemEval – world record, built solo in 16 days for $1k

World record on LongMemEval beats PwC Chronos, built solo in 16 days.

WizardryBig BrainDark Horse

JordanMcCann

103mo ago

Security●●●Banger

AgentThreatBench – Benchmark for AI Agent Memory Security

First OWASP-backed security layer for ASI06 memory poisoning in agentic AI.

Big BrainSolve My Problem

vgudur297

2027d ago

Developer Tools●●Solid

Resurf – realistic, reproducible test framework for AI browser agents

Synthetic e-commerce site with failure injection beats flaky live-site testing.

Big BrainNiche Gem

andrew_zhong

501mo ago

AI/ML●●Solid

Engram – Persistent Memory API with Drift Detection for AI Agents

Mem0 stores facts, but Engram detects when they go stale and break your agent.

Solve My ProblemSlick

Adam_cipher

102mo ago