Muge Capan

h-index6
2papers

2 Papers

LGJan 2, 2025
Multi-Objective Optimization-Based Anonymization of Structured Data for Machine Learning Application

Yusi Wei, Hande Y. Benson, Joseph K. Agor et al.

Organizations are collecting vast amounts of data, but they often lack the capabilities needed to fully extract insights. As a result, they increasingly share data with external experts, such as analysts or researchers, to gain value from it. However, this practice introduces significant privacy risks. Various techniques have been proposed to address privacy concerns in data sharing. However, these methods often degrade data utility, impacting the performance of machine learning (ML) models. Our research identifies key limitations in existing optimization models for privacy preservation, particularly in handling categorical variables, and evaluating effectiveness across diverse datasets. We propose a novel multi-objective optimization model that simultaneously minimizes information loss and maximizes protection against attacks. This model is empirically validated using diverse datasets and compared with two existing algorithms. We assess information loss, the number of individuals subject to linkage or homogeneity attacks, and ML performance after anonymization. The results indicate that our model achieves lower information loss and more effectively mitigates the risk of attacks, reducing the number of individuals susceptible to these attacks compared to alternative algorithms in some cases. Additionally, our model maintains comparable ML performance relative to the original data or data anonymized by other methods. Our findings highlight significant improvements in privacy protection and ML model performance, offering a comprehensive and extensible framework for balancing privacy and utility in data sharing.

APAug 25, 2025
An Analytical Approach to Privacy and Performance Trade-Offs in Healthcare Data Sharing

Yusi Wei, Hande Y. Benson, Muge Capan

The secondary use of healthcare data is vital for research and clinical innovation, but it raises concerns about patient privacy. This study investigates how to balance privacy preservation and data utility in healthcare data sharing, considering the perspectives of both data providers and data users. Using a dataset of adult patients hospitalized between 2013 and 2015, we predict whether sepsis was present at admission or developed during the hospital stay. We identify sub-populations, such as older adults, frequently hospitalized patients, and racial minorities, that are especially vulnerable to privacy attacks due to their unique combinations of demographic and healthcare utilization attributes. These groups are also critical for machine learning (ML) model performance. We evaluate three anonymization methods-$k$-anonymity, the technique by Zheng et al., and the MO-OBAM model-based on their ability to reduce re-identification risk while maintaining ML utility. Results show that $k$-anonymity offers limited protection. The methods of Zheng et al. and MO-OBAM provide stronger privacy safeguards, with MO-OBAM yielding the best utility outcomes: only a 2% change in precision and recall compared to the original dataset. This work provides actionable insights for healthcare organizations on how to share data responsibly. It highlights the need for anonymization methods that protect vulnerable populations without sacrificing the performance of data-driven models.