LGJan 28, 2023

Heterogeneous Datasets for Federated Survival Analysis Simulation

arXiv:2301.12166v212 citationsh-index: 12
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of benchmarking federated survival models for researchers in healthcare and related fields, but it is incremental as it focuses on dataset simulation rather than new modeling methods.

The authors tackled the lack of common benchmarking datasets for federated survival analysis by developing a novel technique to construct realistic heterogeneous datasets from existing non-federated datasets, using Dirichlet distribution-based algorithms for quantity-skewed and label-skewed splitting with adjustable heterogeneity levels, and provided quantitative evaluation via log-rank tests and qualitative analysis.

Survival analysis studies time-modeling techniques for an event of interest occurring for a population. Survival analysis found widespread applications in healthcare, engineering, and social sciences. However, the data needed to train survival models are often distributed, incomplete, censored, and confidential. In this context, federated learning can be exploited to tremendously improve the quality of the models trained on distributed data while preserving user privacy. However, federated survival analysis is still in its early development, and there is no common benchmarking dataset to test federated survival models. This work provides a novel technique for constructing realistic heterogeneous datasets by starting from existing non-federated datasets in a reproducible way. Specifically, we propose two dataset-splitting algorithms based on the Dirichlet distribution to assign each data sample to a carefully chosen client: quantity-skewed splitting and label-skewed splitting. Furthermore, these algorithms allow for obtaining different levels of heterogeneity by changing a single hyperparameter. Finally, numerical experiments provide a quantitative evaluation of the heterogeneity level using log-rank tests and a qualitative analysis of the generated splits. The implementation of the proposed methods is publicly available in favor of reproducibility and to encourage common practices to simulate federated environments for survival analysis.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes