LGMLMay 28, 2018

Dual Policy Iteration

arXiv:1805.10755v260 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of enhancing reinforcement learning algorithms for continuous control tasks, offering incremental improvements by extending existing API theory and unifying model-free and model-based methods.

The paper tackles the problem of improving Approximate Policy Iteration (API) algorithms by analyzing the Dual Policy Iteration (DPI) strategy, which alternates between optimizing a fast reactive policy and a slow non-reactive policy, and demonstrates its efficacy on continuous control Markov Decision Processes with theoretical convergence analysis and a unified model-free and model-based RL approach.

Recently, a novel class of Approximate Policy Iteration (API) algorithms have demonstrated impressive practical performance (e.g., ExIt from [2], AlphaGo-Zero from [27]). This new family of algorithms maintains, and alternately optimizes, two policies: a fast, reactive policy (e.g., a deep neural network) deployed at test time, and a slow, non-reactive policy (e.g., Tree Search), that can plan multiple steps ahead. The reactive policy is updated under supervision from the non-reactive policy, while the non-reactive policy is improved with guidance from the reactive policy. In this work we study this Dual Policy Iteration (DPI) strategy in an alternating optimization framework and provide a convergence analysis that extends existing API theory. We also develop a special instance of this framework which reduces the update of non-reactive policies to model-based optimal control using learned local models, and provides a theoretically sound way of unifying model-free and model-based RL approaches with unknown dynamics. We demonstrate the efficacy of our approach on various continuous control Markov Decision Processes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes