Back to browse
Codex context bloat? 87% avg reduction on SWE-bench Verified traces

Codex context bloat? 87% avg reduction on SWE-bench Verified traces

by george_ciobanu·Apr 24, 2026·10 points·2 comments

AI Analysis

●●SolidBig BrainNiche Gem

Transparent proxy cuts Codex context tokens by 87% via working memory.

Strengths
  • Benchmarks show massive token reduction on SWE-bench traces.
  • Zero code changes required, sits between Codex and OpenAI.
  • Open source TypeScript implementation includes replay testing suite.
Weaknesses
  • Tied specifically to Codex, brittle if upstream API changes.
  • Working memory logic is opaque, risk of losing critical context.
Target Audience

Developers using OpenAI Codex for software engineering tasks

Similar To

Cursor · Continue · LangChain Memory

Post Description

If you had to build a context window manager in 24h, would you stick to the existing model or come up with something better?

Here's what I did:

1. Built a proxy that intercepts Codex's calls to OpenAI and rewrites them on the fly.

2. Replayed 3,807 rounds of SWE-bench Verified traces through it: avg prompt 44k → 6k tokens (-87%).

3. Posted it to HN to get the next reduction applied to my confidence interval — starting with the inevitable "How about accuracy?"

npx -y pando-proxy · github.com/human-software-us/pando-proxy

Similar Projects

AI/ML●●●Banger

97% on SWE-bench Verified with subscription-token agents

97% on SWE-bench Verified with full artifact transparency, not just a score claim.

Big BrainZero to One
kimjune01
2019d ago
Developer Tools●●Solid

We achieved 72.2% issue resolution on SWE-bench Verified using AI teams

They split responsibilities across isolated agents (engineer, reviewer, manager) that get real shell access and independent filesystems, which makes failures traceable and lets you tune model capacity per role. Hitting 72.2% on SWE-bench Verified with no benchmark-specific tuning is an impressive empirical result — interesting architecture and strong evidence — though the security and long-term reliability of autonomous shell-executing agents remain the big open questions.

WizardryBig Brain
NBenkovich
204mo ago