Non-Rectangular Average-Reward Robust MDPs: Optimal Policies and Their Transient Values
This work addresses robust decision-making in uncertain environments for reinforcement learning, providing theoretical guarantees and practical policies, though it is incremental in extending existing regret bounds to robust settings.
The paper tackles the problem of robust Markov decision processes with non-reectangular ambiguity sets under average-reward criteria, showing that policies achieving sublinear expected regret are robust-optimal and that robust value can be represented without rectangularity assumptions. It also constructs an epoch-based policy achieving a constant-order transient value to address poor finite-time performance.
We study non-rectangular robust Markov decision processes under the average-reward criterion, where the ambiguity set couples transition probabilities across states and the adversary commits to a stationary kernel for the entire horizon. We show that any history-dependent policy achieving sublinear expected regret uniformly over the ambiguity set is robust-optimal, and that the robust value admits a minimax representation as the infimum over the ambiguity set of the classical optimal gains, without requiring any form of rectangularity or robust dynamic programming principle. Under the weak communication assumption, we establish the existence of such policies by converting high-probability regret bounds from the average-reward reinforcement learning literature into the expected-regret criterion. We then introduce a transient-value framework to evaluate finite-time performance of robust optimal policies, proving that average-reward optimality alone can mask arbitrarily poor transients and deriving regret-based lower bounds on transient values. Finally, we construct an epoch-based policy that combines an optimal stationary policy for the worst-case model with an anytime-valid sequential test and an online learning fallback, achieving a constant-order transient value.