LGAISep 29, 2025

Robust Policy Expansion for Offline-to-Online RL under Diverse Data Corruption

arXiv:2509.24748v2h-index: 14Has Code
Originality Incremental advance
AI Analysis

This addresses robustness issues in O2O RL for real-world deployment, where data corruption is common, representing an incremental improvement over existing methods that focus on conservatism rather than corruption.

The paper tackles the problem of data corruption in offline-to-online reinforcement learning (O2O RL), which degrades performance, and proposes RPEX, a method that achieves state-of-the-art O2O performance across diverse corruption scenarios as demonstrated on D4RL datasets.

Pretraining a policy on offline data followed by fine-tuning through online interactions, known as Offline-to-Online Reinforcement Learning (O2O RL), has emerged as a promising paradigm for real-world RL deployment. However, both offline datasets and online interactions in practical environments are often noisy or even maliciously corrupted, severely degrading the performance of O2O RL. Existing works primarily focus on mitigating the conservatism of offline policies via online exploration, while the robustness of O2O RL under data corruption, including states, actions, rewards, and dynamics, is still unexplored. In this work, we observe that data corruption induces heavy-tailed behavior in the policy, thereby substantially degrading the efficiency of online exploration. To address this issue, we incorporate Inverse Probability Weighted (IPW) into the online exploration policy to alleviate heavy-tailedness, and propose a novel, simple yet effective method termed $\textbf{RPEX}$: $\textbf{R}$obust $\textbf{P}$olicy $\textbf{EX}$pansion. Extensive experimental results on D4RL datasets demonstrate that RPEX achieves SOTA O2O performance across a wide range of data corruption scenarios. Code is available at $\href{https://github.com/felix-thu/RPEX}{https://github.com/felix-thu/RPEX}$.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes