LGAIJun 7, 2022

Imitating Past Successes can be Very Suboptimal

arXiv:2206.03378v225 citationsh-index: 166
AI Analysis

This work addresses a theoretical gap in RL methods for practitioners, though it is incremental as it builds on prior imitation learning approaches.

The paper analyzes outcome-conditioned imitation learning in reinforcement learning, showing that existing methods may not improve policies, and proposes a simple modification that guarantees policy improvement under certain assumptions.

Prior work has proposed a simple strategy for reinforcement learning (RL): label experience with the outcomes achieved in that experience, and then imitate the relabeled experience. These outcome-conditioned imitation learning methods are appealing because of their simplicity, strong performance, and close ties with supervised learning. However, it remains unclear how these methods relate to the standard RL objective, reward maximization. In this paper, we formally relate outcome-conditioned imitation learning to reward maximization, drawing a precise relationship between the learned policy and Q-values and explaining the close connections between these methods and prior EM-based policy search methods. This analysis shows that existing outcome-conditioned imitation learning methods do not necessarily improve the policy, but a simple modification results in a method that does guarantee policy improvement, under some assumptions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes