GitHub Repository

Smart(er) code reading for humans and AI agents. Reduces cost per correct answer by ~40% on average. Install: cargo install tilth -or- npx tilth

308 starsRust

Tilth v0.4.1 – 29% cheaper Sonnet, 22% on Opus (benchmark: 114 runs)

Name: Tilth v0.4.1 – 29% cheaper Sonnet, 22% on Opus (benchmark: 114 runs)
Availability: InStock
Author: jahala

by jahala·Feb 17, 2026·2 points·0 comments

Visit Project View on HN

AI Analysis

●●●BangerBig BrainSolve My Problem

Instruction tuning on tool descriptions cut Sonnet costs 29% without code changes.

Strengths

•Structural search (tree-sitter) vs text grep avoids false positives in symbol matching
•Transitive callee expansion and bloom-filter-based dedup saves tokens in multi-turn agent sessions
•Rigorous benchmark methodology (cost per correct answer across real repos) proves ROI vs baseline

Weaknesses

•Limited to 4 codebases in benchmark; broader validation needed
•Haiku adoption stays at 42% despite tuning, suggesting smaller models can't leverage ranking

Post Description

Smart code reading for humans and AI agents. Tilth is what happens when you give ripgrep, tree-sitter, and cat a shared brain.

v0.4.0 added search ranking, sibling surfacing, transitive callees, cognitive load stripping, smart truncation, and bloom filters. Got -17% on Sonnet, -20% on Opus.

v0.4.1 was pure instruction tuning — zero code changes that alone jumped Sonnet adoption from 89% to 98% and $ cost/correct answer from -17% to -29%.

The instruction tuning result surprised me. The model already knew tilth tools existed — it just wasn’t choosing them consistently. Making the replacement relationship explicit in the tool description was worth more than all the search ranking work in v0.4.0.

Haiku remains the outlier — only 42% tilth adoption despite instruction tuning.

https://github.com/jahala/tilth/

Full results: https://github.com/jahala/tilth/blob/main/benchmark/README.m...

-- PS: I dont have the budget to run the benchmark a lot (especially with Opus), so if any token whales has capacity to run some benchmarks, please feel free to PR results.