Back to browse
GitHub Repository

Adaptive PHI de-identification for streaming multimodal data: exposure-aware, stateful and audit ready

1 starsPython

Modeled healthcare de-identification as longitudinal RL control problem

by vkatganti·Mar 4, 2026·1 point·0 comments

AI Analysis

●●SolidBig BrainWizardry

Stateful, exposure-aware de-ID over time—novel framing, but repo is research-only with synthetic data.

Strengths
  • Reframes de-identification as a feedback-loop control problem instead of static preprocessing—genuinely different mental model.
  • Cross-modal linkage modeling (text, ASR, image proxies, waveforms) captures re-ID risk that stateless systems miss.
  • Pseudonym versioning on risk escalation is clever—avoids global reprocessing when privacy constraints tighten.
Weaknesses
  • No real data, no production deployment, no evidence of HIPAA compliance or clinical validation—remains a research prototype.
  • Synthetic-only demo limits credibility for healthcare; real-world re-identification risk validation would be essential before clinical use.
Category
Target Audience

Healthcare AI/ML researchers, privacy engineers, HIPAA-compliance teams

Post Description

Most PHI de-identification pipelines are stateless: detect identifiers, remove them, done. The problem is that re-identification risk doesn't work that way in practice.

A name fragment that's harmless in record #1 becomes identifying when it co-occurs with a location in record #47 and a timestamp in record #203. Static masking can't see that.

This project treats de-identification as a stateful control problem instead. The system maintains a per-subject exposure graph across time and modalities, computes rolling re-identification risk, and dynamically escalates masking strength only when cumulative exposure justifies it.

The core idea: privacy protection as a feedback loop, not a preprocessing step.

A few things I found interesting building this: - Cross-modal linkage (text + ASR + image proxy + waveform headers) creates non-obvious re-ID surfaces - Pseudonym versioning on risk escalation lets you contain linkage continuity without global reprocessing - The privacy–utility tradeoff is actually controllable if you model exposure state explicitly

All experiments run on synthetic streaming data (no real PHI). Reproducible from source. Colab demo included.

Repo: https://github.com/azithteja91/phi-exposure-guard

Happy to discuss the architecture, the RL policy design, or the tradeoffs vs. existing de-ID approaches.

Similar Projects