LGJun 24, 2025

Unsupervised Data Generation for Offline Reinforcement Learning: A Perspective from Model

Shuncheng He, Hongchang Zhang, Jianzhun Shao, Yuhang Jiang, Xiangyang Ji

arXiv:2506.19643v1

Originality Incremental advance

AI Analysis

This addresses a key bottleneck in offline RL for researchers, offering a novel data-centric approach to improve performance in task-agnostic settings, though it is incremental as it builds on existing model-based frameworks.

The paper tackles the out-of-distribution problem in offline reinforcement learning by proposing UDG, an unsupervised data generation method that minimizes the performance gap between learned and optimal policies, showing it outperforms supervised data generation on unknown tasks.

Offline reinforcement learning (RL) recently gains growing interests from RL researchers. However, the performance of offline RL suffers from the out-of-distribution problem, which can be corrected by feedback in online RL. Previous offline RL research focuses on restricting the offline algorithm in in-distribution even in-sample action sampling. In contrast, fewer work pays attention to the influence of the batch data. In this paper, we first build a bridge over the batch data and the performance of offline RL algorithms theoretically, from the perspective of model-based offline RL optimization. We draw a conclusion that, with mild assumptions, the distance between the state-action pair distribution generated by the behavioural policy and the distribution generated by the optimal policy, accounts for the performance gap between the policy learned by model-based offline RL and the optimal policy. Secondly, we reveal that in task-agnostic settings, a series of policies trained by unsupervised RL can minimize the worst-case regret in the performance gap. Inspired by the theoretical conclusions, UDG (Unsupervised Data Generation) is devised to generate data and select proper data for offline training under tasks-agnostic settings. Empirical results demonstrate that UDG can outperform supervised data generation on solving unknown tasks.

View on arXiv PDF

Similar