FedPS: Federated data Preprocessing via aggregated Statistics
This addresses a critical bottleneck for practical federated learning deployments by enabling privacy-preserving and efficient data preprocessing, though it is incremental in extending existing methods to federated settings.
The paper tackles the overlooked problem of data preprocessing in federated learning by introducing FedPS, a framework that uses aggregated statistics to enable efficient and consistent preprocessing across distributed datasets, achieving communication efficiency and supporting various preprocessing tasks.
Federated Learning (FL) enables multiple parties to collaboratively train machine learning models without sharing raw data. However, before training, data must be preprocessed to address missing values, inconsistent formats, and heterogeneous feature scales. This preprocessing stage is critical for model performance but is largely overlooked in FL research. In practical FL systems, privacy constraints prohibit centralizing raw data, while communication efficiency introduces further challenges for distributed preprocessing. We introduce FedPS, a unified framework for federated data preprocessing based on aggregated statistics. FedPS leverages data-sketching techniques to efficiently summarize local datasets while preserving essential statistical information. Building on these summaries, we design federated algorithms for feature scaling, encoding, discretization, and missing-value imputation, and extend preprocessing-related models such as k-Means, k-Nearest Neighbors, and Bayesian Linear Regression to both horizontal and vertical FL settings. FedPS provides flexible, communication-efficient, and consistent preprocessing pipelines for practical FL deployments.