Online Optimization for Offline Safe Reinforcement Learning
This work addresses the problem of learning safe policies from fixed data for applications requiring constraint satisfaction, representing an incremental improvement by integrating existing techniques in a new way.
The paper tackles offline safe reinforcement learning by proposing a novel approach that frames the problem as a minimax objective, combining offline RL with online optimization to learn reward-maximizing policies under cumulative cost constraints. Empirical results on the DSRL benchmark show that the method reliably enforces safety constraints under stringent budgets while achieving high rewards.
We study the problem of Offline Safe Reinforcement Learning (OSRL), where the goal is to learn a reward-maximizing policy from fixed data under a cumulative cost constraint. We propose a novel OSRL approach that frames the problem as a minimax objective and solves it by combining offline RL with online optimization algorithms. We prove the approximate optimality of this approach when integrated with an approximate offline RL oracle and no-regret online optimization. We also present a practical approximation that can be combined with any offline RL algorithm, eliminating the need for offline policy evaluation. Empirical results on the DSRL benchmark demonstrate that our method reliably enforces safety constraints under stringent cost budgets, while achieving high rewards. The code is available at https://github.com/yassineCh/O3SRL.