Pisces: Efficient Federated Learning via Guided Asynchronous Training
This addresses the trade-off between speed and data quality in federated learning for distributed machine learning systems, representing an incremental improvement over existing methods.
The paper tackles the problem of slow clients delaying federated learning by presenting Pisces, an asynchronous FL system with intelligent participant selection and model aggregation, which accelerates time-to-accuracy by up to 2.0x compared to synchronous and 1.9x compared to asynchronous schemes.
Federated learning (FL) is typically performed in a synchronous parallel manner, where the involvement of a slow client delays a training iteration. Current FL systems employ a participant selection strategy to select fast clients with quality data in each iteration. However, this is not always possible in practice, and the selection strategy often has to navigate an unpleasant trade-off between the speed and the data quality of clients. In this paper, we present Pisces, an asynchronous FL system with intelligent participant selection and model aggregation for accelerated training. To avoid incurring excessive resource cost and stale training computation, Pisces uses a novel scoring mechanism to identify suitable clients to participate in a training iteration. It also adapts the pace of model aggregation to dynamically bound the progress gap between the selected clients and the server, with a provable convergence guarantee in a smooth non-convex setting. We have implemented Pisces in an open-source FL platform called Plato, and evaluated its performance in large-scale experiments with popular vision and language models. Pisces outperforms the state-of-the-art synchronous and asynchronous schemes, accelerating the time-to-accuracy by up to 2.0x and 1.9x, respectively.