LGJun 2, 2022

When does return-conditioned supervised learning work for offline reinforcement learning?

arXiv:2206.01079v394 citationsh-index: 23
Originality Synthesis-oriented
AI Analysis

This work addresses the theoretical and practical limitations of RCSL for offline RL, which is incremental as it clarifies assumptions rather than proposing a new method.

The paper rigorously analyzes return-conditioned supervised learning (RCSL) for offline RL, finding it requires stronger assumptions than dynamic programming methods to achieve optimal policies and demonstrating its limitations through MDP examples and experiments on D4RL datasets.

Several recent works have proposed a class of algorithms for the offline reinforcement learning (RL) problem that we will refer to as return-conditioned supervised learning (RCSL). RCSL algorithms learn the distribution of actions conditioned on both the state and the return of the trajectory. Then they define a policy by conditioning on achieving high return. In this paper, we provide a rigorous study of the capabilities and limitations of RCSL, something which is crucially missing in previous work. We find that RCSL returns the optimal policy under a set of assumptions that are stronger than those needed for the more traditional dynamic programming-based algorithms. We provide specific examples of MDPs and datasets that illustrate the necessity of these assumptions and the limits of RCSL. Finally, we present empirical evidence that these limitations will also cause issues in practice by providing illustrative experiments in simple point-mass environments and on datasets from the D4RL benchmark.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes