Provably Safe Reinforcement Learning using Entropy Regularizer
This work addresses safety-critical applications in reinforcement learning, such as robotics or autonomous systems, by providing a provably safe learning method, though it builds incrementally on existing optimism-based approaches.
The paper tackles the problem of learning optimal policies in Markov decision processes with safety constraints by proposing an online reinforcement learning algorithm that uses entropy regularization to ensure safety with high probability during learning. The result shows that entropy regularization improves regret bounds and reduces episode-to-episode variability compared to existing methods.
We consider the problem of learning the optimal policy for Markov decision processes with safety constraints. We formulate the problem in a reach-avoid setup. Our goal is to design online reinforcement learning algorithms that ensure safety constraints with arbitrarily high probability during the learning phase. To this end, we first propose an algorithm based on the optimism in the face of uncertainty (OFU) principle. Based on the first algorithm, we propose our main algorithm, which utilizes entropy regularization. We investigate the finite-sample analysis of both algorithms and derive their regret bounds. We demonstrate that the inclusion of entropy regularization improves the regret and drastically controls the episode-to-episode variability that is inherent in OFU-based safe RL algorithms.