Llama 3.2 3B and Keiro Research achieves 85% on SimpleQA

Name: Llama 3.2 3B and Keiro Research achieves 85% on SimpleQA
Availability: InStock
Author: mannybruv

by mannybruv·Mar 7, 2026·6 points·1 comment

Visit Project View on HN

AI Analysis

●●●BangerBig BrainShip It

Retrieval-aware inference beats 671B models by showing context matters more than scale.

Strengths

•Replaces expensive oracle scaling with a focused retrieval loop, shipping real code.
•Clear benchmark: 85% vs Sonar Pro's 85.8% on 3B params vs 671B competitors.
•Economics shift: $0.005 per query commoditizes agent reasoning for anyone with a laptop.

Weaknesses

•SimpleQA is a narrow benchmark, unclear generalization to complex reasoning tasks.
•Keiro is a closed API dependency, not a pure open-source win.

Post Description

ran this over the weekend. stack was Llama 3.2 3B running locally + Keiro Research API for retrieval.

85.0% on 4,326 questions. where that lands:

ROMA (357B): 93.9% OpenDeepSearch (671B): 88.3% Sonar Pro: 85.8% Llama 3.2 3B + Keiro: 85.0%

the systems ahead of us are running models 100-200x larger. that's why they're ahead. not better retrieval, not better prompting — just way more parameters.

the interesting part is how small the gap is despite that. 3 points behind a 671B model. 0.8 behind Sonar Pro. at some point you have to ask what you're actually buying with all that compute for this class of task.

Want to know how low the reader model can go before it starts mattering. in this setup it clearly wasn't the limiting factor and also if smaller models with web enabled will perform as good( if not better) as larger models for a lot of non coding tasks

Full benchmark script + results --> https://github.com/h-a-r-s-h-s-r-a-h/benchmark

Keiro research -- https://www.keirolabs.cloud/docs/api-reference/research