Efficient $Q$-Learning and Actor-Critic Methods for Robust Average Reward Reinforcement Learning
This addresses robust policy learning in uncertain environments for reinforcement learning practitioners, representing an incremental advance with specific theoretical guarantees.
The paper tackles robust average-reward reinforcement learning under uncertainty sets by analyzing Q-learning and actor-critic methods, achieving sample complexities of Õ(ε⁻²) for learning ε-optimal robust policies.
We present a non-asymptotic convergence analysis of $Q$-learning and actor-critic algorithms for robust average-reward Markov Decision Processes (MDPs) under contamination, total-variation (TV) distance, and Wasserstein uncertainty sets. A key ingredient of our analysis is showing that the optimal robust $Q$ operator is a strict contraction with respect to a carefully designed semi-norm (with constant functions quotiented out). This property enables a stochastic approximation update that learns the optimal robust $Q$-function using $\tilde{\mathcal{O}}(ε^{-2})$ samples. We also provide an efficient routine for robust $Q$-function estimation, which in turn facilitates robust critic estimation. Building on this, we introduce an actor-critic algorithm that learns an $ε$-optimal robust policy within $\tilde{\mathcal{O}}(ε^{-2})$ samples. We provide numerical simulations to evaluate the performance of our algorithms.