Reinforcement Learning for Task Specifications with Action-Constraints
This work addresses safety constraints in reinforcement learning for applications like robotics or autonomous systems, but it is incremental as it builds on existing supervisory control and reward machine concepts.
The paper tackles the problem of learning optimal control policies in reinforcement learning under non-Markovian action and state constraints, using supervisory control theory and automata to enforce safety, and demonstrates the method through simulations.
In this paper, we use concepts from supervisory control theory of discrete event systems to propose a method to learn optimal control policies for a finite-state Markov Decision Process (MDP) in which (only) certain sequences of actions are deemed unsafe (respectively safe). We assume that the set of action sequences that are deemed unsafe and/or safe are given in terms of a finite-state automaton; and propose a supervisor that disables a subset of actions at every state of the MDP so that the constraints on action sequence are satisfied. Then we present a version of the Q-learning algorithm for learning optimal policies in the presence of non-Markovian action-sequence and state constraints, where we use the development of reward machines to handle the state constraints. We illustrate the method using an example that captures the utility of automata-based methods for non-Markovian state and action specifications for reinforcement learning and show the results of simulations in this setting.