Sky Computing: Accelerating Geo-distributed Computing in Federated Learning
This addresses the problem of inefficient training in federated learning for users with heterogeneous devices, though it is incremental as it builds on existing model parallelism methods.
The paper tackles the computation bottleneck in federated learning due to varying device capabilities by proposing Sky Computing, a load-balanced model parallelism framework, which reduces training time by 55% compared to baseline when training a 160-layer BERT model with 64 nodes.
Federated learning is proposed by Google to safeguard data privacy through training models locally on users' devices. However, with deep learning models growing in size to achieve better results, it becomes increasingly difficult to accommodate the whole model on one single device. Thus, model parallelism is then used to divide the model weights among several devices. With this logic, the approach currently used evenly allocates weights among devices. However, in reality, a computation bottleneck may occur resulting from variant computing power of different users' devices. To address this problem, load balancing is needed to allocate the model weights based on the computational capability of the device. In this paper, we proposed Sky Computing, a load-balanced model parallelism framework to adaptively allocate the weights to devices. Sky Computing outperforms the baseline method by 55% in training time when training 160-layer BERT with 64 nodes. The source code can be found at https://github.com/hpcaitech/SkyComputing.