On Large-Cohort Training for Federated Learning
This work addresses scalability and efficiency problems for federated learning practitioners, but it is incremental as it builds on existing methods to explore cohort size effects.
The paper investigates how increasing the number of clients sampled per round (cohort size) affects model quality and training dynamics in federated learning, identifying challenges like generalization issues, diminishing returns, training failures, and fairness concerns through empirical evaluation.
Federated learning methods typically learn a model by iteratively sampling updates from a population of clients. In this work, we explore how the number of clients sampled at each round (the cohort size) impacts the quality of the learned model and the training dynamics of federated learning algorithms. Our work poses three fundamental questions. First, what challenges arise when trying to scale federated learning to larger cohorts? Second, what parallels exist between cohort sizes in federated learning and batch sizes in centralized learning? Last, how can we design federated learning methods that effectively utilize larger cohort sizes? We give partial answers to these questions based on extensive empirical evaluation. Our work highlights a number of challenges stemming from the use of larger cohorts. While some of these (such as generalization issues and diminishing returns) are analogs of large-batch training challenges, others (including training failures and fairness concerns) are unique to federated learning.