FRAIN to Train: A Fast-and-Reliable Solution for Decentralized Federated Learning
This work addresses challenges in decentralized federated learning for distributed AI systems, offering an incremental improvement over prior methods.
The paper tackles the problem of slow convergence and client drift in asynchronous federated learning by introducing FRAIN, which uses FastSync and spherical linear interpolation to improve stability and robustness, achieving better performance than existing methods under non-IID data, network delays, and malicious nodes.
Federated learning (FL) enables collaborative model training across distributed clients while preserving data locality. Although FedAvg pioneered synchronous rounds for global model averaging, slower devices can delay collective progress. Asynchronous FL (e.g., FedAsync) addresses stragglers by continuously integrating client updates, yet naive implementations risk client drift due to non-IID data and stale contributions. Some Blockchain-based FL approaches (e.g., BRAIN) employ robust weighting or scoring of updates to resist malicious or misaligned proposals. However, performance drops can still persist under severe data heterogeneity or high staleness, and synchronization overhead has emerged as a new concern due to its aggregator-free architectures. We introduce Fast-and-Reliable AI Network, FRAIN, a new asynchronous FL method that mitigates these limitations by incorporating two key ideas. First, our FastSync strategy eliminates the need to replay past model versions, enabling newcomers and infrequent participants to efficiently approximate the global model. Second, we adopt spherical linear interpolation (SLERP) when merging parameters, preserving models' directions and alleviating destructive interference from divergent local training. Experiments with a CNN image-classification model and a Transformer-based language model demonstrate that FRAIN achieves more stable and robust convergence than FedAvg, FedAsync, and BRAIN, especially under harsh environments: non-IID data distributions, networks that experience delays and require frequent re-synchronization, and the presence of malicious nodes.