Back to browse
GitHub Repository

Zero-config entity resolution that scales from a CSV to 100M+ rows on a Ray cluster (verified: 100M deduped in 213s, 0.30 GB driver). Fuzzy + exact + probabilistic dedupe, identity graph, PPRL, LLM boost. Python + full TypeScript port; SQL-native in PostgreSQL & DuckDB; MCP/REST servers, dbt + Airflow recipes.

108 starsPython

GoldenMatch – 100M-row dedupe on Ray in 213s, no Spark, Arrow-native

by benzsevern·Jun 4, 2026·3 points·0 comments

AI Analysis

●●SolidBig BrainShip It

Ray-based dedupe at 100M rows without Spark — that's a real architectural choice.

Strengths
  • 0.30 GB driver footprint while processing 100M records is genuinely impressive.
  • Polyglot support across Python, TypeScript, PostgreSQL, and DuckDB.
  • MCP servers and dbt recipes show production integration thinking.
Weaknesses
  • Entity resolution already has established players like Splink and Dedupe.io.
  • 74 stars suggests early adoption — real-world scale validation still pending.
Category
Target Audience

Data engineers, data scientists

Similar To

Splink · Dedupe.io · OpenRefine

Similar Projects

Data●●Solid

GoldenMatch – Entity resolution with LLM scoring, 97% F1, no Spark

Fellegi-Sunter matching with active learning beats Dedupe.io on complex datasets.

Big BrainSolve My Problem
benzsevern
302mo ago
AI/MLMid

Small Model Marketplace W 100M Tokens Free

100M free tokens is generous, but Hugging Face and Replicate already host models.

Niche Gem
robmay
102mo ago