LGMLJan 29

Achieving $\varepsilon^{-2}$ Dependence for Average-Reward Q-Learning with a New Contraction Principle

arXiv:2601.21301v1h-index: 6
Originality Highly original
AI Analysis

This provides a theoretical foundation for efficient reinforcement learning in average-reward settings, addressing a known bottleneck with practical implications for continuous control and robotics.

The paper tackles the challenge of non-asymptotic convergence for average-reward Q-learning, where contraction is absent, by introducing a lazy transformation and a new seminorm to achieve optimal sample complexity of O(ε^{-2}) up to logarithmic factors for both synchronous and asynchronous variants.

We present the convergence rates of synchronous and asynchronous Q-learning for average-reward Markov decision processes, where the absence of contraction poses a fundamental challenge. Existing non-asymptotic results overcome this challenge by either imposing strong assumptions to enforce seminorm contraction or relying on discounted or episodic Markov decision processes as successive approximations, which either require unknown parameters or result in suboptimal sample complexity. In this work, under a reachability assumption, we establish optimal $\widetilde{O}(\varepsilon^{-2})$ sample complexity guarantees (up to logarithmic factors) for a simple variant of synchronous and asynchronous Q-learning that samples from the lazified dynamics, where the system remains in the current state with some fixed probability. At the core of our analysis is the construction of an instance-dependent seminorm and showing that, after a lazy transformation of the Markov decision process, the Bellman operator becomes one-step contractive under this seminorm.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes