MLLGMEJan 22, 2020

Stratified cross-validation for unbiased and privacy-preserving federated learning

arXiv:2001.08090v20.0010 citations
AI Analysis20

This addresses bias issues for users of federated learning in privacy-sensitive domains, but is incremental as it builds on existing stratification techniques.

The paper tackled the problem of duplicated records causing over-optimistic performance estimates in federated learning, and introduced stratified cross-validation to prevent data leakage without needing deduplication algorithms.

Large-scale collections of electronic records constitute both an opportunity for the development of more accurate prediction models and a threat for privacy. To limit privacy exposure new privacy-enhancing techniques are emerging such as federated learning which enables large-scale data analysis while avoiding the centralization of records in a unique database that would represent a critical point of failure. Although promising regarding privacy protection, federated learning prevents using some data-cleaning algorithms thus inducing new biases. In this work we focus on the recurrent problem of duplicated records that, if not handled properly, may cause over-optimistic estimations of a model's performances. We introduce and discuss stratified cross-validation, a validation methodology that leverages stratification techniques to prevent data leakage in federated learning settings without relying on demanding deduplication algorithms.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes