Back to browse
GitHub Repository

Production-grade RAG API built in Rust. Hybrid search with HNSW dense vectors and BM25 sparse matching, cross-encoder reranking, layout-aware document extraction via Docling, and 94.5% accuracy on Open RAG Bench. Powered by Cerebras, Groq, Milvus, and Jina AI.

195 starsRust

RustyRAG lowest-latency open-source RAG on GitHub

by zer0tokens·Mar 4, 2026·1 point·0 comments

AI Analysis

●●●BangerWizardryDark Horse

Sub-600ms RAG across continents without GPU beats standard vector-DB-plus-LLM stacks.

Strengths
  • Genuine latency wins: 200ms locally, 600ms intercontinental on commodity CPU via Groq+Jina, not marketing hyperbole — benchmarks with real PDFs and chunks are reproducible
  • Switched to local embeddings (Jina v5-nano) + streaming LLM inference (Groq) to eliminate API round-trips and serialization bottlenecks — architectural discipline, not just tuning
  • Open-source Rust codebase with Docker/Swagger means plug-and-play replacement for heavier RAG stacks (LangChain, Llamaindex)
Weaknesses
  • Latency advantage depends entirely on Groq/Cerebras tier-1 inference; no fallback strategy if those APIs degrade or become unavailable
  • 977-PDF corpus is small; scaling behavior to million-doc + million-chunk indices unverified; Milvus local embeddings fine for small retrieval sets but recall at scale unclear
Target Audience

Engineers building RAG applications who prioritize latency and cost over accuracy; teams without GPU infrastructure

Similar To

LlamaIndex · LangChain · Haystack

Post Description

I built an open-source RAG API in Rust to see how low I could push latency without a GPU.

RustyRAG v0.2 hits sub-200ms on localhost and sub-600ms from Azure North Central US to a browser in Brazil. 977 PDFs, 56K chunks in Milvus, 3 sources per response.

Key changes in v0.2: switched to Cerebras/Groq for LLM inference, replaced Cohere with Jina AI local embeddings (v5-text-nano-retrieval), and added optional contextual retrieval via LLM-generated chunk prefixes.

Similar Projects

Data●●●Banger

LaminarDB – Streaming SQL database in Rust, zero-alloc hot path

Sub-microsecond streaming SQL via zero-alloc hot path; genuine advancement over SQLite+DataFusion.

WizardryNiche GemSolve My Problem
sujitn
613mo ago