Back to browse
GitHub Repository

Zero-config entity resolution that scales from a CSV to 100M+ rows on a Ray cluster (verified: 100M deduped in 213s, 0.30 GB driver). Fuzzy + exact + probabilistic dedupe, identity graph, PPRL, LLM boost. Python + full TypeScript port; SQL-native in PostgreSQL & DuckDB; MCP/REST servers, dbt + Airflow recipes.

97 starsPython

GoldenMatch – Entity resolution with LLM scoring, 97% F1, no Spark

by benzsevern·Mar 21, 2026·3 points·0 comments

AI Analysis

●●SolidBig BrainSolve My Problem

Fellegi-Sunter matching with active learning beats Dedupe.io on complex datasets.

Strengths
  • Fellegi-Sunter EM-trained probabilities with automatic threshold estimation built in.
  • Active learning TUI: label 10 borderline pairs, instantly retrain classifier.
  • Privacy-preserving bloom filter transforms for fuzzy matching on encrypted PII.
Weaknesses
  • Entity resolution is crowded: Dedupe.io, OpenRefine, and commercial tools already exist.
  • LLM scoring and Vertex AI embeddings require paid API keys for best accuracy.
Category
Target Audience

Data engineers, analysts working with messy duplicate records

Similar To

Dedupe.io · OpenRefine · Tamr

Similar Projects

Data●●Solid

GoldenMatch – 100M-row dedupe on Ray in 213s, no Spark, Arrow-native

Ray-based dedupe at 100M rows without Spark — that's a real architectural choice.

Big BrainShip It
benzsevern
309d ago
Developer Tools●●Solid

Treliq – PR triage CLI with 20 signals and optional LLM scoring

Deduping PRs and scoring them with 20 heuristic signals is a concrete, useful idea — especially the scope-coherence signal and embedding auto-fallback for providers without embeddings. The repo supports CLI, a persistent server, GitHub App integration and an explicit --model flag for provider flexibility, but it's still early and adoption/UX examples (ranked output, workflows) are thin — promising engineering scaffolding that needs real-world validation.

Niche GemSolve My Problem
chrismagno
103mo ago
AI/ML●●Solid

Entroly – Compress codebase context for LLMs by 78% using Rust

Entropy-based context compression beats naive token stuffing, but the category is crowded.

Big BrainNiche Gem
savetokens
102mo ago