Back to browse
Fastdedup – Rust dataset deduplication (2:55 vs. 7:55 688MB vs. 22GB)

Fastdedup – Rust dataset deduplication (2:55 vs. 7:55 688MB vs. 22GB)

by wapplewhite4·Feb 24, 2026·1 point·0 comments

AI Analysis

●●●BangerWizardryBig BrainSolve My Problem

Bloom filter + AHash pipeline cuts exact dedup from 7:55 to 2:55, 688MB vs 21.9GB RAM.

Strengths
  • Genuine constraint solving: single-machine dedup for people who don't have distributed clusters
  • Character n-grams instead of spaCy tokenization unlocks 6x fuzzy-dedup speedup over datatrove
  • Honest benchmarking against reference implementations with identical duplicate counts
Weaknesses
  • Fuzzy dedup requires 23GB RAM—cloud-only, not laptop-scale at typical dataset sizes
  • Inherently single-threaded exact dedup means limited parallelism despite modern CPUs
Target Audience

ML engineers, data scientists preparing LLM datasets

Similar To

DuckDB · datatrove · text-dedup

Post Description

I've been working on a Rust CLI for dataset deduplication and wanted to share benchmark results. Ran on FineWeb sample-10BT (14.8M records, 29GB) on a single machine. Exact dedup vs DuckDB + SHA-256:

2:55 vs 7:55 wall clock (2.7x faster) 688MB vs 21.9GB peak RAM (32x less) Single core vs 4+ cores Duplicate counts match exactly (51,392 both ways)

Fuzzy dedup (MinHash + LSH) vs datatrove:

36:44 vs 3h50m+ — datatrove stage 1 alone ran for 3h50m and we killed it datatrove's bottleneck turned out to be spaCy word tokenization on every document before shingling. fastdedup uses character n-grams directly which is significantly cheaper 23GB vs 1.1GB RAM — this is a real trade-off, not a win. datatrove streams to disk; fastdedup holds the LSH index in memory for speed

Honest caveats:

Fuzzy dedup needs ~23GB RAM at this scale — cloud workload, not a laptop workload datatrove is built for distributed execution, tasks=1 isn't its intended config — this is how someone would run it locally Tiered storage to spill LSH index to disk is on the roadmap

Demo: https://huggingface.co/spaces/wapplewhite4/fastdedup-demo Repo: https://github.com/wapplewhite4/fastdedup Happy to answer questions about implementation or methodology.

Similar Projects

AI/ML●●Solid

Entroly – Compress codebase context for LLMs by 78% using Rust

Entropy-based context compression beats naive token stuffing, but the category is crowded.

Big BrainNiche Gem
savetokens
102mo ago