SY LG OC MLFeb 25, 2023

On Bellman's principle of optimality and Reinforcement learning for safety-constrained Markov decision process

Rahul Misra, Rafał Wisniewski, Carsten Skovmose Kallesøe

arXiv:2302.13152v31.2h-index: 16

Originality Incremental advance

AI Analysis

This addresses safety constraints in reinforcement learning for decision-making systems, but it is incremental as it builds on existing counterexamples and Lagrangian methods.

The paper tackles the problem of safe reinforcement learning in constrained Markov decision processes with multichain structures, where Bellman's principle of optimality may fail, by formulating it as a zero-sum game and developing an asynchronous value iteration scheme and a modified Q-learning algorithm with error bounds.

We study optimality for the safety-constrained Markov decision process which is the underlying framework for safe reinforcement learning. Specifically, we consider a constrained Markov decision process (with finite states and finite actions) where the goal of the decision maker is to reach a target set while avoiding an unsafe set(s) with certain probabilistic guarantees. Therefore the underlying Markov chain for any control policy will be multichain since by definition there exists a target set and an unsafe set. The decision maker also has to be optimal (with respect to a cost function) while navigating to the target set. This gives rise to a multi-objective optimization problem. We highlight the fact that Bellman's principle of optimality may not hold for constrained Markov decision problems with an underlying multichain structure (as shown by the counterexample due to Haviv. We resolve the counterexample by formulating the aforementioned multi-objective optimization problem as a zero-sum game and thereafter construct an asynchronous value iteration scheme for the Lagrangian (similar to Shapley's algorithm). Finally, we consider the reinforcement learning problem for the same and construct a modified $Q$-learning algorithm for learning the Lagrangian from data. We also provide a lower bound on the number of iterations required for learning the Lagrangian and corresponding error bounds.

View on arXiv PDF

Similar