GitHub Repository

Adaptive PHI de-identification for streaming multimodal data: exposure-aware, stateful and audit ready

1 starsPython

Modeled healthcare de-identification as longitudinal RL control problem

Name: Modeled healthcare de-identification as longitudinal RL control problem
Availability: InStock
Author: vkatganti

by vkatganti·Mar 4, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●●SolidBig BrainWizardry

Stateful, exposure-aware de-ID over time—novel framing, but repo is research-only with synthetic data.

Strengths

•Reframes de-identification as a feedback-loop control problem instead of static preprocessing—genuinely different mental model.
•Cross-modal linkage modeling (text, ASR, image proxies, waveforms) captures re-ID risk that stateless systems miss.
•Pseudonym versioning on risk escalation is clever—avoids global reprocessing when privacy constraints tighten.

Weaknesses

•No real data, no production deployment, no evidence of HIPAA compliance or clinical validation—remains a research prototype.
•Synthetic-only demo limits credibility for healthcare; real-world re-identification risk validation would be essential before clinical use.

Post Description

Most PHI de-identification pipelines are stateless: detect identifiers, remove them, done. The problem is that re-identification risk doesn't work that way in practice.

A name fragment that's harmless in record #1 becomes identifying when it co-occurs with a location in record #47 and a timestamp in record #203. Static masking can't see that.

This project treats de-identification as a stateful control problem instead. The system maintains a per-subject exposure graph across time and modalities, computes rolling re-identification risk, and dynamically escalates masking strength only when cumulative exposure justifies it.

The core idea: privacy protection as a feedback loop, not a preprocessing step.

A few things I found interesting building this: - Cross-modal linkage (text + ASR + image proxy + waveform headers) creates non-obvious re-ID surfaces - Pseudonym versioning on risk escalation lets you contain linkage continuity without global reprocessing - The privacy–utility tradeoff is actually controllable if you model exposure state explicitly

All experiments run on synthetic streaming data (no real PHI). Reproducible from source. Colab demo included.

Repo: https://github.com/azithteja91/phi-exposure-guard

Happy to discuss the architecture, the RL policy design, or the tradeoffs vs. existing de-ID approaches.