Mildly Conservative Regularized Evaluation for Offline Reinforcement Learning
This work addresses the problem of overestimation in offline RL for researchers and practitioners, but it appears incremental as it builds on existing conservative methods.
The paper tackles the challenge of distribution shift in offline reinforcement learning by proposing a framework that balances conservatism and performance, resulting in an algorithm that outperforms strong baselines and state-of-the-art methods on benchmark datasets.
Offline reinforcement learning (RL) seeks to learn optimal policies from static datasets without further environment interaction. A key challenge is the distribution shift between the learned and behavior policies, leading to out-of-distribution (OOD) actions and overestimation. To prevent gross overestimation, the value function must remain conservative; however, excessive conservatism may hinder performance improvement. To address this, we propose the mildly conservative regularized evaluation (MCRE) framework, which balances conservatism and performance by combining temporal difference (TD) error with a behavior cloning term in the Bellman backup. Building on this, we develop the mildly conservative regularized Q-learning (MCRQ) algorithm, which integrates MCRE into an off-policy actor-critic framework. Experiments show that MCRQ outperforms strong baselines and state-of-the-art offline RL algorithms on benchmark datasets.