CR CYJun 30, 2025

Aim High, Stay Private: Differentially Private Synthetic Data Enables Public Release of Behavioral Health Information with High Utility

Mohsen Ghasemizade, Juniper Lovato, Christopher M. Danforth, Peter Sheridan Dodds, Laura S. P. Bloomfield, Matthew Price, Team LEMURS, Joseph P. Near

arXiv:2507.02971h-index: 5

Originality Synthesis-oriented

AI Analysis

For researchers and institutions sharing sensitive behavioral health data, this work provides a practical framework to release DP synthetic data with quantified privacy-utility trade-offs, though the approach is incremental.

The authors applied differential privacy (DP) via the Adaptive Iterative Mechanism (AIM) to generate synthetic data from a real behavioral health study (LEMURS), achieving high utility at epsilon=5 while mitigating privacy risks. The synthetic data preserved predictive utility for downstream analyses.

Sharing health and behavioral data raises significant privacy concerns, as conventional de-identification methods are susceptible to privacy attacks. Differential Privacy (DP) provides formal guarantees against re-identification risks, but practical implementation necessitates balancing privacy protection and the utility of data. We demonstrate the use of DP to protect individuals in a real behavioral health study, while making the data publicly available and retaining high utility for downstream users of the data. We use the Adaptive Iterative Mechanism (AIM) to generate DP synthetic data for Phase 1 of the Lived Experiences Measured Using Rings Study (LEMURS). The LEMURS dataset comprises physiological measurements from wearable devices (Oura rings) and self-reported survey data from first-year college students. We evaluate the synthetic datasets across a range of privacy budgets, epsilon = 1 to 100, focusing on the trade-off between privacy and utility. We evaluate the utility of the synthetic data using a framework informed by actual uses of the LEMURS dataset. Our evaluation identifies the trade-off between privacy and utility across synthetic datasets generated with different privacy budgets. We find that synthetic data sets with epsilon = 5 preserve adequate predictive utility while significantly mitigating privacy risks. Our methodology establishes a reproducible framework for evaluating the practical impacts of epsilon on generating private synthetic datasets with numerous attributes and records, contributing to informed decision-making in data sharing practices.

View on arXiv PDF

Similar