LGApr 29, 2024

A Universal Metric of Dataset Similarity for Cross-silo Federated Learning

arXiv:2404.18773v24 citationsh-index: 4ICDM
Originality Incremental advance
AI Analysis

This addresses the challenge of facilitating successful collaborations in privacy-sensitive domains like healthcare by enabling better assessment of data compatibility without data-sharing, though it is incremental as it builds on existing FL methods.

The paper tackles the problem of non-identical data distributions degrading model performance in cross-silo federated learning by proposing a novel, dataset-agnostic metric for assessing dataset similarity, which is privacy-preserving, computationally efficient, and shows a robust relationship with model performance across synthetic, benchmark, and medical imaging datasets.

Federated Learning is increasingly used in domains such as healthcare to facilitate collaborative model training without data-sharing. However, datasets located in different sites are often non-identically distributed, leading to degradation of model performance in FL. Most existing methods for assessing these distribution shifts are limited by being dataset or task-specific. Moreover, these metrics can only be calculated by exchanging data, a practice restricted in many FL scenarios. To address these challenges, we propose a novel metric for assessing dataset similarity. Our metric exhibits several desirable properties for FL: it is dataset-agnostic, is calculated in a privacy-preserving manner, and is computationally efficient, requiring no model training. In this paper, we first establish a theoretical connection between our metric and training dynamics in FL. Next, we extensively evaluate our metric on a range of datasets including synthetic, benchmark, and medical imaging datasets. We demonstrate that our metric shows a robust and interpretable relationship with model performance and can be calculated in privacy-preserving manner. As the first federated dataset similarity metric, we believe this metric can better facilitate successful collaborations between sites.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes