LGSYOCMLJun 3, 2024

Achieving Tractable Minimax Optimal Regret in Average Reward MDPs

arXiv:2406.01234v111 citations
Originality Highly original
AI Analysis

This solves a computational efficiency and regret guarantee bottleneck for researchers and practitioners in reinforcement learning, representing a significant advancement rather than an incremental improvement.

The paper tackles the problem of learning average-reward Markov Decision Processes (MDPs) by presenting the first tractable algorithm that achieves minimax optimal regret of $\widetilde{\mathrm{O}}(\sqrt{\mathrm{sp}(h^*) S A T})$ without prior knowledge of $\mathrm{sp}(h^*)$.

In recent years, significant attention has been directed towards learning average-reward Markov Decision Processes (MDPs). However, existing algorithms either suffer from sub-optimal regret guarantees or computational inefficiencies. In this paper, we present the first tractable algorithm with minimax optimal regret of $\widetilde{\mathrm{O}}(\sqrt{\mathrm{sp}(h^*) S A T})$, where $\mathrm{sp}(h^*)$ is the span of the optimal bias function $h^*$, $S \times A$ is the size of the state-action space and $T$ the number of learning steps. Remarkably, our algorithm does not require prior information on $\mathrm{sp}(h^*)$. Our algorithm relies on a novel subroutine, Projected Mitigated Extended Value Iteration (PMEVI), to compute bias-constrained optimal policies efficiently. This subroutine can be applied to various previous algorithms to improve regret bounds.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes