GitHub Repository

Python library to search over the Epstein Files. AI-powered vector search across unsealed court documents, FBI reports, and flight logs. Runs entirely locally or with API.

6 starsPython

Epstein-Search – Local, AI-Powered Search Engine for the Epstein Files

Name: Epstein-Search – Local, AI-Powered Search Engine for the Epstein Files
Availability: InStock
Author: simulationship

by simulationship·Mar 1, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●●SolidBig BrainNiche Gem

Offline RAG over Epstein Files with sentence-transformers and local LLM fallback.

Strengths

•Entirely offline vector search (no API keys, no data leakage) is genuinely privacy-respecting for sensitive documents
•Pre-computed embeddings (all-MiniLM-L6-v2) means setup is one minute, not hours of indexing
•LiteLLM + Ollama/LM Studio integration lets users choose local or cloud LLMs on the fly

Weaknesses

•100K pre-computed chunks may be stale if source documents are updated; no refresh mechanism documented
•Niche corpus (one case) limits utility; not a general-purpose RAG framework

Post Description

Hi HN, I built epstein-search, an open-source Python CLI and library to run semantic search and RAG over the publicly released Epstein Files (unsealed court documents, depositions, FBI reports, and flight logs). I wanted a way to easily navigate through these thousands of pages of unstructured legal PDFs without relying on a paid third-party service or sending data back and forth to a cloud provider. How it works under the hood: Running epstein-search setup downloads ~100K pre-computed document chunks and embeddings (using all-MiniLM-L6-v2) based on the public 20K document corpus. It imports these into zvec (a local vector database) so the index is ready in about a minute. Standard search (epstein-search search) embeds your query locally using sentence-transformers and does a vector similarity search. This step is 100% offline and requires no API keys. For the conversational RAG mode (epstein-search chat or ask), it uses LiteLLM. You can point it to an Ollama or LM Studio instance for a completely free, local, and private pipeline, or plug in a cloud provider like Anthropic, OpenAI, or Gemini. You can also filter queries by document type (e.g., --doc-type flight_log or --source "FBI") and output the raw source context alongside the generated answers to verify the LLM's claims. The dataset is strictly sourced from public domain releases (DOJ, House Oversight Committee, unsealed federal court docs). Repo: https://github.com/simulationship/epstein-search I'd love to hear your thoughts, feedback on the code, or any ideas for improving the local RAG pipeline! Happy to answer any questions.