Back to browse
GitHub Repository

Synthetic dirty data generator

34 starsPython

MessyData – Synthetic dirty data generator

by santiviquez·Mar 9, 2026·1 point·0 comments

AI Analysis

MidSolve My ProblemCozy

Claude Code skill integration is nice, but Faker already generates dirty data.

Strengths
  • YAML schema with lognormal distributions produces realistic data patterns
  • Claude Code skill lets agents write configs and validate without manual setup
  • Date-range generation with rows-per-day supports temporal testing scenarios
Weaknesses
  • Synthetic data generation is solved—Faker, sdv, and dozens of libraries exist
  • No evidence of anomalies more realistic than what existing tools already inject
Target Audience

Data engineers testing pipelines and ML workflows

Similar To

Faker · SDV · Mockaroo

Similar Projects

Data●●Solid

MedSynth – Multi-lingual synthetic healthcare data with OCR artifacts

This isn't another clean, English-only faker — it intentionally models script-specific OCR errors (Hebrew/Arabic/Latin confusions), per-hospital schema variance, and country-specific ID formats so models see the sort of mess real systems do. Output is NDJSON and usable from the CLI, which makes it straightforward to plug into pipelines, but the repo looks very new and documentation/examples are thin — promising concept, you’ll still need to tinker to use it at scale.

Niche GemWizardry
Alechko
113mo ago
Developer Tools●●Solid

Alyt – type-safe multi-provider analytics SDK

YAML→TypeScript codegen for analytics prevents typos and centralizes event definitions.

Solve My ProblemBig Brain
jrandolf
103mo ago