Fastdedup – Rust dataset deduplication (2:55 vs. 7:55 688MB vs. 22GB)

Name: Fastdedup – Rust dataset deduplication (2:55 vs. 7:55 688MB vs. 22GB)
Availability: InStock
Author: wapplewhite4

by wapplewhite4·Feb 24, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●●●BangerWizardryBig BrainSolve My Problem

Bloom filter + AHash pipeline cuts exact dedup from 7:55 to 2:55, 688MB vs 21.9GB RAM.

Strengths

•Genuine constraint solving: single-machine dedup for people who don't have distributed clusters
•Character n-grams instead of spaCy tokenization unlocks 6x fuzzy-dedup speedup over datatrove
•Honest benchmarking against reference implementations with identical duplicate counts

Weaknesses

•Fuzzy dedup requires 23GB RAM—cloud-only, not laptop-scale at typical dataset sizes
•Inherently single-threaded exact dedup means limited parallelism despite modern CPUs

Post Description

I've been working on a Rust CLI for dataset deduplication and wanted to share benchmark results. Ran on FineWeb sample-10BT (14.8M records, 29GB) on a single machine. Exact dedup vs DuckDB + SHA-256:

2:55 vs 7:55 wall clock (2.7x faster) 688MB vs 21.9GB peak RAM (32x less) Single core vs 4+ cores Duplicate counts match exactly (51,392 both ways)

Fuzzy dedup (MinHash + LSH) vs datatrove:

36:44 vs 3h50m+ — datatrove stage 1 alone ran for 3h50m and we killed it datatrove's bottleneck turned out to be spaCy word tokenization on every document before shingling. fastdedup uses character n-grams directly which is significantly cheaper 23GB vs 1.1GB RAM — this is a real trade-off, not a win. datatrove streams to disk; fastdedup holds the LSH index in memory for speed

Honest caveats:

Fuzzy dedup needs ~23GB RAM at this scale — cloud workload, not a laptop workload datatrove is built for distributed execution, tasks=1 isn't its intended config — this is how someone would run it locally Tiered storage to spill LSH index to disk is on the roadmap

Demo: https://huggingface.co/spaces/wapplewhite4/fastdedup-demo Repo: https://github.com/wapplewhite4/fastdedup Happy to answer questions about implementation or methodology.