OC LGSep 17, 2025

Bellman Optimality of Average-Reward Robust Markov Decision Processes with a Constant Gain

Stanford

arXiv:2509.14203v211.34 citationsh-index: 6

Originality Synthesis-oriented

AI Analysis

This work addresses a technical gap in robust MDP theory for long-run average criteria, which is relevant for operations research and management applications, but it is incremental as it builds on existing frameworks.

The paper tackles the problem of establishing dynamic programming foundations for average-reward robust Markov decision processes (MDPs) with a constant gain, focusing on the existence of solutions to the robust Bellman equation and their relationship to optimal average reward and policies, and provides conditions ensuring solution existence.

Learning and optimal control under robust Markov decision processes (MDPs) have received increasing attention, yet most existing theory, algorithms, and applications focus on finite-horizon or discounted models. Long-run average-reward formulations, while natural in many operations research and management contexts, remain underexplored. This is primarily because the dynamic programming foundations are technically challenging and only partially understood, with several fundamental questions remaining open. This paper steps toward a general framework for average-reward robust MDPs by analyzing the constant-gain setting. We study the average-reward robust control problem with possible information asymmetries between the controller and an S-rectangular adversary. Our analysis centers on the constant-gain robust Bellman equation, examining both the existence of solutions and their relationship to the optimal average reward. Specifically, we identify when solutions to the robust Bellman equation characterize the optimal average reward and stationary policies, and we provide one-sided weak communication conditions ensuring solutions' existence. These findings expand the dynamic programming theory for average-reward robust MDPs and lay a foundation for robust dynamic decision making under long-run average criteria in operational environments.

View on arXiv PDF

Similar