The Provable Benefits of Unsupervised Data Sharing for Offline Reinforcement Learning
This work addresses the challenge of leveraging unlabeled data for offline reinforcement learning, which is incremental as it builds on existing self-supervised methods but introduces a novel algorithm with theoretical guarantees.
The paper tackles the problem of how to conduct self-supervised offline reinforcement learning in a principled way by investigating the theoretical benefits of using reward-free data in linear Markov Decision Processes and proposing a Provable Data Sharing algorithm (PDS) that improves performance on various offline RL tasks.
Self-supervised methods have become crucial for advancing deep learning by leveraging data itself to reduce the need for expensive annotations. However, the question of how to conduct self-supervised offline reinforcement learning (RL) in a principled way remains unclear. In this paper, we address this issue by investigating the theoretical benefits of utilizing reward-free data in linear Markov Decision Processes (MDPs) within a semi-supervised setting. Further, we propose a novel, Provable Data Sharing algorithm (PDS) to utilize such reward-free data for offline RL. PDS uses additional penalties on the reward function learned from labeled data to prevent overestimation, ensuring a conservative algorithm. Our results on various offline RL tasks demonstrate that PDS significantly improves the performance of offline RL algorithms with reward-free data. Overall, our work provides a promising approach to leveraging the benefits of unlabeled data in offline RL while maintaining theoretical guarantees. We believe our findings will contribute to developing more robust self-supervised RL methods.