Efficient Offline Reinforcement Learning: First Imitate, then Improve
This work addresses computational inefficiency and instability in offline reinforcement learning, which is an incremental improvement for researchers and practitioners in the field.
The paper tackles the problem of inefficient and unstable training in offline reinforcement learning by proposing a hybrid approach that pre-trains with supervised imitation learning before applying off-policy reinforcement learning, resulting in substantially improved training time and greater stability on standard benchmarks.
Supervised imitation-based approaches are often favored over off-policy reinforcement learning approaches for learning policies offline, since their straightforward optimization objective makes them computationally efficient and stable to train. However, their performance is fundamentally limited by the behavior policy that collected the dataset. Off-policy reinforcement learning provides a promising approach for improving on the behavior policy, but training is often computationally inefficient and unstable due to temporal-difference bootstrapping. In this paper, we propose a best-of-both approach by pre-training with supervised learning before improving performance with off-policy reinforcement learning. Specifically, we demonstrate improved efficiency by pre-training an actor with behavior cloning and a critic with a supervised Monte-Carlo value error. We find that we are able to substantially improve the training time of popular off-policy algorithms on standard benchmarks, and also achieve greater stability. Code is available at: https://github.com/AdamJelley/EfficientOfflineRL