Synthetic Healthcare Data & Privacy
Creating privacy-preserving synthetic healthcare data generation methods that maintain clinical utility while protecting patient confidentiality.

This research area addresses the critical need for high-quality healthcare data while respecting patient privacy and regulatory requirements. The work focuses on developing advanced generative models that can create synthetic electronic health records, clinical trial data, and multi-modal healthcare datasets that preserve statistical properties and clinical patterns of real data while ensuring patient anonymity.
Key innovations include hierarchical autoregressive language models for longitudinal EHR synthesis, multi-granular simulation frameworks for complex healthcare data, and generative adversarial networks for patient record generation. The research tackles fundamental challenges in healthcare data sharing including temporal dependencies in medical records, rare disease representation, multi-modal data integration, and the balance between data utility and privacy protection.
The synthetic data generated enables broader research collaboration, algorithm development, and clinical studies while maintaining HIPAA compliance and ethical standards.
Synthesize High-Dimensional Longitudinal Electronic Health Records via Hierarchical Autoregressive Language Model (Nature Communications, 2023)
MediSim: Multi-granular Simulation for Enriching Longitudinal, Multi-modal Electronic Health Records (Patterns, 2025)
Generating Multi-label Discrete Patient Records using Generative Adversarial Networks (ML4HC 2017)
Interested in This Research Area?
We welcome collaborations with researchers, clinicians, and industry partners working in synthetic healthcare data & privacy. Our lab is always looking for motivated students and postdocs to join our team.