Research

Synthetic Data

Synthetic Healthcare Data and Privacy

The goal is useful data without unsafe disclosure: synthetic data that supports model development, collaboration, and validation while respecting privacy constraints.

01Sensitive health data
02Generative model
03Privacy and fidelity checks
04Shareable research asset

Recent papers

What this program is building

Selected recent and foundational papers, summarized around the task, why it matters, and the main technical result.

2025Patterns

MediSim: Multi-Granular Simulation for Enriching Longitudinal, Multi-Modal Electronic Health Records

1Longitudinal EHR
2Multi-granular simulator
3Synthetic patient timeline
Task
Generate longitudinal, multimodal EHR-like data at multiple clinical granularities.
Why it matters
Useful synthetic EHR can expand research access while reducing reliance on sensitive raw patient data.
Main result
MediSim models visits, codes, and modalities together so synthetic records remain clinically useful.
Paper details
2025Patterns

SECONDGRAM: Self-Conditioned Diffusion with Gradient Manipulation for Longitudinal MRI Imputation

1Incomplete MRI visits
2Diffusion imputation
3Consistent timeline
Task
Impute missing longitudinal MRI observations with diffusion modeling.
Why it matters
Longitudinal imaging studies often have missing visits; better imputation can preserve cohort value.
Main result
Self-conditioning and gradient manipulation improve longitudinal consistency in generated MRI sequences.
Paper details
2024AAAI

ConSequence: Synthesizing Logically Constrained Sequences for Electronic Health Record Generation

1Clinical rules
2Constrained generation
3Plausible EHR sequence
Task
Generate EHR sequences that obey logical clinical constraints.
Why it matters
Synthetic records must be plausible, not just statistically similar, for downstream clinical modeling.
Main result
ConSequence injects logical constraints so generated event sequences better respect clinical structure.
Paper details
2024arXiv

TrialSynth: Generation of Synthetic Sequential Clinical Trial Data

1Trial trajectories
2Sequential generator
3Shareable trial data
Task
Create synthetic sequential data for clinical trial research.
Why it matters
Synthetic trial data can help prototype methods when real trial data are hard to share.
Main result
TrialSynth extends generative modeling from patient records toward trial-like longitudinal sequences.
Paper details
2024Value in Health

P53 Validation of AI-Generated Synthetic Data for Outcomes Research: A Clinico-Genomics Case Study

1Clinico-genomics
2Synthetic validation
3Outcomes research utility
Task
Validate AI-generated synthetic data in a clinico-genomics outcomes research setting.
Why it matters
The paper addresses whether synthetic data can support real-world evidence questions, not only demo metrics.
Main result
It frames validation around downstream outcomes research utility and clinico-genomic consistency.
Paper details
2023Nature Communications

Synthesize High-Dimensional Longitudinal Electronic Health Records via Hierarchical Autoregressive Language Model

1High-dimensional EHR
2Hierarchical language model
3Synthetic records
Task
Generate high-dimensional longitudinal EHR with a hierarchical autoregressive language model.
Why it matters
This established a strong foundation for scalable, privacy-aware synthetic patient records.
Main result
The model captures temporal and hierarchical EHR structure to create clinically useful synthetic records.
Paper details

Representative publication links

Interested in this program?

Send a concise note with the program name, your role, the problem you want to work on, and any relevant data, code, clinical setting, or research experience.

Contact Sunlab
Clinical AI Medical LLMs Drug Discovery