Back to browse
GitHub Repository

Smart(er) code reading for humans and AI agents. Reduces cost per correct answer by ~40% on average. Install: cargo install tilth -or- npx tilth

240 starsRust

Tilth v0.4.1 – 29% cheaper Sonnet, 22% on Opus (benchmark: 114 runs)

by jahala·Feb 17, 2026·2 points·0 comments

AI Analysis

●●●BangerBig BrainSolve My Problem

Instruction tuning on tool descriptions cut Sonnet costs 29% without code changes.

Strengths
  • Structural search (tree-sitter) vs text grep avoids false positives in symbol matching
  • Transitive callee expansion and bloom-filter-based dedup saves tokens in multi-turn agent sessions
  • Rigorous benchmark methodology (cost per correct answer across real repos) proves ROI vs baseline
Weaknesses
  • Limited to 4 codebases in benchmark; broader validation needed
  • Haiku adoption stays at 42% despite tuning, suggesting smaller models can't leverage ranking
Target Audience

AI engineers, backend developers, devops teams using Claude for code analysis

Similar To

Sourcegraph Cody · Continue.dev · Cursor

Post Description

Smart code reading for humans and AI agents. Tilth is what happens when you give ripgrep, tree-sitter, and cat a shared brain.

--

v0.4.0 added search ranking, sibling surfacing, transitive callees, cognitive load stripping, smart truncation, and bloom filters. Got -17% on Sonnet, -20% on Opus.

v0.4.1 was pure instruction tuning — zero code changes that alone jumped Sonnet adoption from 89% to 98% and $ cost/correct answer from -17% to -29%.

The instruction tuning result surprised me. The model already knew tilth tools existed — it just wasn’t choosing them consistently. Making the replacement relationship explicit in the tool description was worth more than all the search ranking work in v0.4.0.

Haiku remains the outlier — only 42% tilth adoption despite instruction tuning.

--

https://github.com/jahala/tilth/

Full results: https://github.com/jahala/tilth/blob/main/benchmark/README.m...

-- PS: I dont have the budget to run the benchmark a lot (especially with Opus), so if any token whales has capacity to run some benchmarks, please feel free to PR results.

Similar Projects