LGCRCYFeb 9, 2025

Privacy-Preserving Dataset Combination

arXiv:2502.05765v31 citationsh-index: 1
Originality Highly original
AI Analysis

This work addresses the problem of limited data sharing due to privacy concerns, particularly for smaller organizations in regulated domains like healthcare.

The authors tackled the problem of private dataset combination, achieving high consistency (>90% correlation) with non-private counterparts, and successfully identifying beneficial data collaborations. Their protocol, SecureKL, maximizes data utilization while preserving privacy.

Access to diverse, high-quality datasets is crucial for machine learning model performance, yet data sharing remains limited by privacy concerns and competitive interests, particularly in regulated domains like healthcare. This dynamic especially disadvantages smaller organizations that lack resources to purchase data or negotiate favorable sharing agreements, due to the inability to \emph{privately} assess external data's utility. To resolve privacy and uncertainty tensions simultaneously, we introduce {\SecureKL}, the first secure protocol for dataset-to-dataset evaluations with zero privacy leakage, designed to be applied preceding data sharing. {\SecureKL} evaluates a source dataset against candidates, performing dataset divergence metrics internally with private computations, all without assuming downstream models. On real-world data, {\SecureKL} achieves high consistency ($>90\%$ correlation with non-private counterparts) and successfully identifies beneficial data collaborations in highly-heterogeneous domains (ICU mortality prediction across hospitals and income prediction across states). Our results highlight that secure computation maximizes data utilization, outperforming privacy-agnostic utility assessments that leak information.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes