Back to browse
RepoGauge – save token costs and compare agents on your own repos

RepoGauge – save token costs and compare agents on your own repos

by siliconc0w·Apr 17, 2026·1 point·0 comments

AI Analysis

●●SolidSolve My ProblemBig Brain

Mining your own bugfix history for evals beats public benchmarks that don't match your codebase.

Strengths
  • Repo-specific benchmarks from actual bugfix commits beat generic leaderboards
  • Built-in cost telemetry tracks tokens, cache hits, and spend per solver attempt
  • Deterministic hashing keeps regression checks traceable across runs
Weaknesses
  • Author admits it's 'medium-rare' — limited language support currently
  • Requires significant token budget to run meaningful multi-model comparisons
Category
Target Audience

Engineering teams evaluating AI coding assistants

Similar To

SWE-bench · Aider · Codex CLI

Post Description

I've grown increasingly skeptical that public coding benchmarks tell me much about which model is actually worth paying for and worried that as demand continues to spike model providers will silently drop performance.

I did a few manual analyses but found it non-trivial to compare across models due to difference in token caching and tool-use efficiency and so wanted a tool for repeatable evaluations.

So the goal was an OSS tool get data to help answer questions like:

“Would Sonnet have solved most of the issues we gave Opus? "How much would that have actually saved?” “What about OSS models like Kimi K2.5 or GLM-1?” “The vibes are off, did model performance just regress from last month?”

Right now the project is a bit medium-rare - but it works end-to-end. I’ve run it successfully against itself, and I’m waiting for my token limits to reset so I can add support for more languages and do a broader run. I'm already seeing a few cases where I could've used 5.4-mini instead of 5.4 for some parts of implementation.

I’d love any feedback, criticism, and ideas. I am especially interested if this is something you might pay for as a managed service or if you would contribute your private testcases to a shared commons hold-out set to hold AI providers a bit more accountable.

https://repogauge.org [email protected] https://github.com/s1liconcow/repogauge

Thanks! David

Similar Projects

AI/ML●●Solid

Ctx, save tokens by loading only the relevant tools

Pre-session tool selection via 102K-node graph beats inline token compression.

Big BrainNiche Gem
stevesolun
821d ago