Constrained Exploration in Reinforcement Learning with Optimality Preservation
This work addresses the challenge of ensuring optimal policy learning in RL under safety or operational constraints, though it is incremental as it builds on existing theories of discrete-event systems.
The paper tackles the problem of reinforcement learning agents exploring under behavioral constraints that could lead to sub-optimal policies, by introducing a method for constrained exploration that preserves optimality, establishing a necessary and sufficient condition for this preservation in deterministic environments.
We consider a class of reinforcement-learning systems in which the agent follows a behavior policy to explore a discrete state-action space to find an optimal policy while adhering to some restriction on its behavior. Such restriction may prevent the agent from visiting some state-action pairs, possibly leading to the agent finding only a sub-optimal policy. To address this problem we introduce the concept of constrained exploration with optimality preservation, whereby the exploration behavior of the agent is constrained to meet a specification while the optimality of the (original) unconstrained learning process is preserved. We first establish a feedback-control structure that models the dynamics of the unconstrained learning process. We then extend this structure by adding a supervisor to ensure that the behavior of the agent meets the specification, and establish (for a class of reinforcement-learning problems with a known deterministic environment) a necessary and sufficient condition under which optimality is preserved. This work demonstrates the utility and the prospect of studying reinforcement-learning problems in the context of the theories of discrete-event systems, automata and formal languages.