Augmenting Offline Reinforcement Learning with State-only Interactions
This addresses a practical challenge in reinforcement learning for domains like simulators or cyber-physical systems where rewards are hard to obtain, offering an incremental improvement over existing methods.
The paper tackles the problem of reinforcement learning when only state observations are available during online interaction, without reward feedback, by augmenting offline data with high-return trajectories generated via conditional diffusion models and stitching them together. The result is superior empirical performance over state-of-the-art data augmentation methods adapted for this setting.
Batch offline data have been shown considerably beneficial for reinforcement learning. Their benefit is further amplified by upsampling with generative models. In this paper, we consider a novel opportunity where interaction with environment is feasible, but only restricted to observations, i.e., \textit{no reward} feedback is available. This setting is broadly applicable, as simulators or even real cyber-physical systems are often accessible, while in contrast reward is often difficult or expensive to obtain. As a result, the learner must make good sense of the offline data to synthesize an efficient scheme of querying the transition of state. Our method first leverages online interactions to generate high-return trajectories via conditional diffusion models. They are then blended with the original offline trajectories through a stitching algorithm, and the resulting augmented data can be applied generically to downstream reinforcement learners. Superior empirical performance is demonstrated over state-of-the-art data augmentation methods that are extended to utilize state-only interactions.