Safe Exploration Using Bayesian World Models and Log-Barrier Optimization
This addresses safety-critical deployment issues for reinforcement learning in real-world applications, representing a strong specific gain rather than a broad paradigm shift.
The paper tackles the challenge of ensuring safety in reinforcement learning during online tasks by proposing CERL, a method for constrained Markov decision processes that uses Bayesian world models and log-barrier optimization to maintain policy safety during learning, demonstrating it outperforms state-of-the-art methods in safety and optimality from image observations.
A major challenge in deploying reinforcement learning in online tasks is ensuring that safety is maintained throughout the learning process. In this work, we propose CERL, a new method for solving constrained Markov decision processes while keeping the policy safe during learning. Our method leverages Bayesian world models and suggests policies that are pessimistic w.r.t. the model's epistemic uncertainty. This makes CERL robust towards model inaccuracies and leads to safe exploration during learning. In our experiments, we demonstrate that CERL outperforms the current state-of-the-art in terms of safety and optimality in solving CMDPs from image observations.