LGJul 26, 2024

A Sharper Global Convergence Analysis for Average Reward Reinforcement Learning via an Actor-Critic Approach

arXiv:2407.18878v39 citationsh-index: 9
AI Analysis

This addresses scalability and complexity issues in reinforcement learning for researchers and practitioners, though it is incremental as it builds on existing actor-critic methods.

The paper tackles the problem of average-reward reinforcement learning with general policy parametrization, achieving a global convergence rate of O~(1/√T) without requiring knowledge of mixing or hitting times and without scaling with state space size.

This work examines average-reward reinforcement learning with general policy parametrization. Existing state-of-the-art (SOTA) guarantees for this problem are either suboptimal or hindered by several challenges, including poor scalability with respect to the size of the state-action space, high iteration complexity, and dependence on knowledge of mixing times and hitting times. To address these limitations, we propose a Multi-level Monte Carlo-based Natural Actor-Critic (MLMC-NAC) algorithm. Our work is the first to achieve a global convergence rate of $\tilde{\mathcal{O}}(1/\sqrt{T})$ for average-reward Markov Decision Processes (MDPs) (where $T$ is the horizon length), without requiring the knowledge of mixing and hitting times. Moreover, the convergence rate does not scale with the size of the state space, therefore even being applicable to infinite state spaces.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes