LG AI MA PFDec 9, 2023

Speed Up Federated Learning in Heterogeneous Environment: A Dynamic Tiering Approach

Seyed Mahmoud Sajjadi Mohammadabadi, Syed Zawad, Feng Yan, Lei Yang

arXiv:2312.05642v16.68 citationsh-index: 33Has Code

Originality Incremental advance

AI Analysis

This addresses the straggler problem in federated learning for resource-constrained devices, offering a domain-specific improvement.

The paper tackles the problem of slow training in federated learning due to device heterogeneity by proposing DTFL, a dynamic tiering approach that offloads parts of the model to the server, reducing training time by up to 40% while maintaining accuracy on large models like ResNet-110 across multiple datasets.

Federated learning (FL) enables collaboratively training a model while keeping the training data decentralized and private. However, one significant impediment to training a model using FL, especially large models, is the resource constraints of devices with heterogeneous computation and communication capacities as well as varying task sizes. Such heterogeneity would render significant variations in the training time of clients, resulting in a longer overall training time as well as a waste of resources in faster clients. To tackle these heterogeneity issues, we propose the Dynamic Tiering-based Federated Learning (DTFL) system where slower clients dynamically offload part of the model to the server to alleviate resource constraints and speed up training. By leveraging the concept of Split Learning, DTFL offloads different portions of the global model to clients in different tiers and enables each client to update the models in parallel via local-loss-based training. This helps reduce the computation and communication demand on resource-constrained devices and thus mitigates the straggler problem. DTFL introduces a dynamic tier scheduler that uses tier profiling to estimate the expected training time of each client, based on their historical training time, communication speed, and dataset size. The dynamic tier scheduler assigns clients to suitable tiers to minimize the overall training time in each round. We first theoretically prove the convergence properties of DTFL. We then train large models (ResNet-56 and ResNet-110) on popular image datasets (CIFAR-10, CIFAR-100, CINIC-10, and HAM10000) under both IID and non-IID systems. Extensive experimental results show that compared with state-of-the-art FL methods, DTFL can significantly reduce the training time while maintaining model accuracy.

View on arXiv PDF Code

Similar