Conflict-Aware Client Selection for Multi-Server Federated Learning
This work addresses bandwidth conflicts and training latency in multi-server federated learning systems, which is an incremental improvement for distributed machine learning applications.
The paper tackles the problem of resource contention and training failures in multi-server federated learning due to overlapping client coverage and uncoordinated selection, proposing a decentralized reinforcement learning approach that reduces inter-server conflicts and improves training efficiency, with experiments showing significant gains in convergence speed and communication cost.
Federated learning (FL) has emerged as a promising distributed machine learning (ML) that enables collaborative model training across clients without exposing raw data, thereby preserving user privacy and reducing communication costs. Despite these benefits, traditional single-server FL suffers from high communication latency due to the aggregation of models from a large number of clients. While multi-server FL distributes workloads across edge servers, overlapping client coverage and uncoordinated selection often lead to resource contention, causing bandwidth conflicts and training failures. To address these limitations, we propose a decentralized reinforcement learning with conflict risk prediction, named RL CRP, to optimize client selection in multi-server FL systems. Specifically, each server estimates the likelihood of client selection conflicts using a categorical hidden Markov model based on its sparse historical client selection sequence. Then, a fairness-aware reward mechanism is incorporated to promote long-term client participation for minimizing training latency and resource contention. Extensive experiments demonstrate that the proposed RL-CRP framework effectively reduces inter-server conflicts and significantly improves training efficiency in terms of convergence speed and communication cost.