LG MLMay 18, 2025

Provably Sample-Efficient Robust Reinforcement Learning with Average Reward

Zachary Roch, Chi Zhang, George Atia, Yue Wang

arXiv:2505.12462v24.12 citationsh-index: 3

Originality Highly original

AI Analysis

This work addresses the lack of finite-sample guarantees for robust RL in long-term decision-making, which is crucial for practical deployment in data-limited scenarios.

The paper tackles the problem of robust reinforcement learning under the average-reward criterion by proposing Robust Halpern Iteration (RHI), which achieves a sample complexity of ̃O(SAH^2/ε^2) to learn an ε-optimal robust policy, representing the tightest known bound.

Robust reinforcement learning (RL) under the average-reward criterion is essential for long-term decision-making, particularly when the environment may differ from its specification. However, a significant gap exists in understanding the finite-sample complexity of these methods, as most existing work provides only asymptotic guarantees. This limitation hinders their principled understanding and practical deployment, especially in data-limited scenarios. We close this gap by proposing \textbf{Robust Halpern Iteration (RHI)}, a new algorithm designed for robust Markov Decision Processes (MDPs) with transition uncertainty characterized by $\ell_p$-norm and contamination models. Our approach offers three key advantages over previous methods: (1). Weaker Structural Assumptions: RHI only requires the underlying robust MDP to be communicating, a less restrictive condition than the commonly assumed ergodicity or irreducibility; (2). No Prior Knowledge: Our algorithm operates without requiring any prior knowledge of the robust MDP; (3). State-of-the-Art Sample Complexity: To learn an $ε$-optimal robust policy, RHI achieves a sample complexity of $\tilde{\mathcal O}\left(\frac{SA\mathcal H^{2}}{ε^{2}}\right)$, where $S$ and $A$ denote the numbers of states and actions, and $\mathcal H$ is the robust optimal bias span. This result represents the tightest known bound. Our work hence provides essential theoretical understanding of sample efficiency of robust average reward RL.

View on arXiv PDF

Similar