A Universal Metric of Dataset Similarity for Cross-silo Federated Learning
This addresses the challenge of facilitating successful collaborations in privacy-sensitive domains like healthcare by enabling better assessment of data compatibility without data-sharing, though it is incremental as it builds on existing FL methods.
The paper tackles the problem of non-identical data distributions degrading model performance in cross-silo federated learning by proposing a novel, dataset-agnostic metric for assessing dataset similarity, which is privacy-preserving, computationally efficient, and shows a robust relationship with model performance across synthetic, benchmark, and medical imaging datasets.
Federated Learning is increasingly used in domains such as healthcare to facilitate collaborative model training without data-sharing. However, datasets located in different sites are often non-identically distributed, leading to degradation of model performance in FL. Most existing methods for assessing these distribution shifts are limited by being dataset or task-specific. Moreover, these metrics can only be calculated by exchanging data, a practice restricted in many FL scenarios. To address these challenges, we propose a novel metric for assessing dataset similarity. Our metric exhibits several desirable properties for FL: it is dataset-agnostic, is calculated in a privacy-preserving manner, and is computationally efficient, requiring no model training. In this paper, we first establish a theoretical connection between our metric and training dynamics in FL. Next, we extensively evaluate our metric on a range of datasets including synthetic, benchmark, and medical imaging datasets. We demonstrate that our metric shows a robust and interpretable relationship with model performance and can be calculated in privacy-preserving manner. As the first federated dataset similarity metric, we believe this metric can better facilitate successful collaborations between sites.