MessyData – Synthetic dirty data generator

Name: MessyData – Synthetic dirty data generator
Availability: InStock
Author: santiviquez

by santiviquez·Mar 9, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●MidSolve My ProblemCozy

Claude Code skill integration is nice, but Faker already generates dirty data.

Strengths

•YAML schema with lognormal distributions produces realistic data patterns
•Claude Code skill lets agents write configs and validate without manual setup
•Date-range generation with rows-per-day supports temporal testing scenarios

Weaknesses

•Synthetic data generation is solved—Faker, sdv, and dozens of libraries exist
•No evidence of anomalies more realistic than what existing tools already inject

Similar Projects

Developer Tools●●●Banger

YAML-schema-router – content-based schema routing for yaml ls

Content-aware schema routing kills YAML LSP guesswork; elegant stdio proxy architecture.

Big BrainNiche Gem

drLuca

213mo ago

AI/ML●Mid

Apery – Synthetic Data Generator for AI Agents

Yet another synthetic data tool when Faker and Mockaroo already exist.

Ship It

compuficial

218d ago

Data●●Solid

MedSynth – Multi-lingual synthetic healthcare data with OCR artifacts

This isn't another clean, English-only faker — it intentionally models script-specific OCR errors (Hebrew/Arabic/Latin confusions), per-hospital schema variance, and country-specific ID formats so models see the sort of mess real systems do. Output is NDJSON and usable from the CLI, which makes it straightforward to plug into pipelines, but the repo looks very new and documentation/examples are thin — promising concept, you’ll still need to tinker to use it at scale.

Niche GemWizardry

Alechko

113mo ago

Developer Tools●●Solid