Back to browse
GitHub Repository

Synthetic medical record generator with realistic schema variance across locales

1 starsPython

MedSynth – Multi-lingual synthetic healthcare data with OCR artifacts

by Alechko·Feb 18, 2026·1 point·1 comment

AI Analysis

●●SolidNiche GemWizardry
The Take

This isn't another clean, English-only faker — it intentionally models script-specific OCR errors (Hebrew/Arabic/Latin confusions), per-hospital schema variance, and country-specific ID formats so models see the sort of mess real systems do. Output is NDJSON and usable from the CLI, which makes it straightforward to plug into pipelines, but the repo looks very new and documentation/examples are thin — promising concept, you’ll still need to tinker to use it at scale.

Category
Target Audience

ML researchers, data scientists, and engineers working on healthcare NLP/OCR models or anyone who needs messy, multilingual synthetic medical records for training and evaluation

Similar Projects