Mitigating Distribution Shift in Model-based Offline RL via Shifts-aware Reward Learning
This addresses distribution shift in offline RL, a key challenge for deploying RL in real-world applications like robotics and autonomous systems, but is incremental as it builds on existing model-based methods.
The paper tackles distribution shift in model-based offline reinforcement learning by analyzing it as model bias and policy shift, then proposes a shifts-aware reward learning method that modifies rewards to refine value learning and policy training. Empirical results show the approach mitigates distribution shift and achieves superior or comparable performance across benchmarks.
Model-based offline reinforcement learning trains policies using pre-collected datasets and learned environment models, eliminating the need for direct real-world environment interaction. However, this paradigm is inherently challenged by distribution shift~(DS). Existing methods address this issue by leveraging off-policy mechanisms and estimating model uncertainty, but they often result in inconsistent objectives and lack a unified theoretical foundation. This paper offers a comprehensive analysis that disentangles the problem into two fundamental components: model bias and policy shift. Our theoretical and empirical investigations reveal how these factors distort value estimation and restrict policy optimization. To tackle these challenges, we derive a novel shifts-aware reward through a unified probabilistic inference framework, which modifies the vanilla reward to refine value learning and facilitate policy training. Building on this, we develop a practical implementation that leverages classifier-based techniques to approximate the adjusted reward for effective policy optimization. Empirical results across multiple benchmarks demonstrate that the proposed approach mitigates distribution shift and achieves superior or comparable performance, validating our theoretical insights.