LGMLFeb 11, 2021

Optimization Issues in KL-Constrained Approximate Policy Iteration

arXiv:2102.06234v117 citations
Originality Incremental advance
AI Analysis

This work addresses stability and convergence problems in reinforcement learning algorithms like TRPO and MPO, which is incremental but important for practitioners tuning these methods.

The paper investigates the optimization issues of using KL-divergence as a constraint versus a regularizer in approximate policy iteration, showing that the constrained approach can fail to converge and incur linear regret even on simple problems, while regularization improves the optimization landscape.

Many reinforcement learning algorithms can be seen as versions of approximate policy iteration (API). While standard API often performs poorly, it has been shown that learning can be stabilized by regularizing each policy update by the KL-divergence to the previous policy. Popular practical algorithms such as TRPO, MPO, and VMPO replace regularization by a constraint on KL-divergence of consecutive policies, arguing that this is easier to implement and tune. In this work, we study this implementation choice in more detail. We compare the use of KL divergence as a constraint vs. as a regularizer, and point out several optimization issues with the widely-used constrained approach. We show that the constrained algorithm is not guaranteed to converge even on simple problem instances where the constrained problem can be solved exactly, and in fact incurs linear expected regret. With approximate implementation using softmax policies, we show that regularization can improve the optimization landscape of the original objective. We demonstrate these issues empirically on several bandit and RL environments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes