ML LGJan 20

Sample Complexity of Average-Reward Q-Learning: From Single-agent to Federated Reinforcement Learning

Yuchen Jiao, Jiin Woo, Gen Li, Gauri Joshi, Yuejie Chi

arXiv:2601.13642v11.7h-index: 4

Originality Highly original

AI Analysis

It provides theoretical foundations for efficient long-term decision-making in reinforcement learning, with incremental improvements in sample complexity and extension to federated learning.

This work tackles the problem of establishing sample complexity guarantees for Q-learning in average-reward Markov decision processes, showing that a carefully parameterized algorithm achieves improved sample complexity for single-agent settings and reduces per-agent complexity in federated scenarios with efficient communication.

Average-reward reinforcement learning offers a principled framework for long-term decision-making by maximizing the mean reward per time step. Although Q-learning is a widely used model-free algorithm with established sample complexity in discounted and finite-horizon Markov decision processes (MDPs), its theoretical guarantees for average-reward settings remain limited. This work studies a simple but effective Q-learning algorithm for average-reward MDPs with finite state and action spaces under the weakly communicating assumption, covering both single-agent and federated scenarios. For the single-agent case, we show that Q-learning with carefully chosen parameters achieves sample complexity $\widetilde{O}\left(\frac{|\mathcal{S}||\mathcal{A}|\|h^{\star}\|_{\mathsf{sp}}^3}{\varepsilon^3}\right)$, where $\|h^{\star}\|_{\mathsf{sp}}$ is the span norm of the bias function, improving previous results by at least a factor of $\frac{\|h^{\star}\|_{\mathsf{sp}}^2}{\varepsilon^2}$. In the federated setting with $M$ agents, we prove that collaboration reduces the per-agent sample complexity to $\widetilde{O}\left(\frac{|\mathcal{S}||\mathcal{A}|\|h^{\star}\|_{\mathsf{sp}}^3}{M\varepsilon^3}\right)$, with only $\widetilde{O}\left(\frac{\|h^{\star}\|_{\mathsf{sp}}}{\varepsilon}\right)$ communication rounds required. These results establish the first federated Q-learning algorithm for average-reward MDPs, with provable efficiency in both sample and communication complexity.

View on arXiv PDF

Similar