Back to browse
GitHub Repository

An open benchmark for the failure modes of agent memory systems: retraction, collision, recall, conflict. Offline, zero-dependency, reproducible.

0 starsTypeScript

A benchmark for the failure modes of agent memory

by Pankhi123·Jun 27, 2026·1 point·0 comments

AI Analysis

●●●BangerBig BrainDark Horse

Shows retrieval metrics lie—answer correctness ranges from 23% to 92% across identical scores.

Strengths
  • Four failure modes expose bugs that make agents confidently wrong, not just retrieval misses
  • Zero dependencies and offline execution means anyone can reproduce leaderboard in one command
  • Baselines reveal typed-constraint beats keyword and recency by modeling time and identity
Weaknesses
  • Narrow audience—only matters if you're building agent memory systems
  • Benchmark only, not a tool you'd integrate into production workflows
Category
Target Audience

AI agent developers, ML engineers building memory systems, researchers

Similar To

HELM · AgentBench · GAIA

Similar Projects

AI/ML●●●Banger

96.2% on LongMemEval – world record, built solo in 16 days for $1k

World record on LongMemEval beats PwC Chronos, built solo in 16 days.

WizardryBig BrainDark Horse
JordanMcCann
103mo ago