LG MLMay 28, 2018

Dual Policy Iteration

Wen Sun, Geoffrey J. Gordon, Byron Boots, J. Andrew Bagnell

arXiv:1805.10755v219.361 citationsh-index: 56

Originality Incremental advance

AI Analysis

This work addresses the challenge of enhancing reinforcement learning algorithms for continuous control tasks, offering incremental improvements by extending existing API theory and unifying model-free and model-based methods.

The paper tackles the problem of improving Approximate Policy Iteration (API) algorithms by analyzing the Dual Policy Iteration (DPI) strategy, which alternates between optimizing a fast reactive policy and a slow non-reactive policy, and demonstrates its efficacy on continuous control Markov Decision Processes with theoretical convergence analysis and a unified model-free and model-based RL approach.

Recently, a novel class of Approximate Policy Iteration (API) algorithms have demonstrated impressive practical performance (e.g., ExIt from [2], AlphaGo-Zero from [27]). This new family of algorithms maintains, and alternately optimizes, two policies: a fast, reactive policy (e.g., a deep neural network) deployed at test time, and a slow, non-reactive policy (e.g., Tree Search), that can plan multiple steps ahead. The reactive policy is updated under supervision from the non-reactive policy, while the non-reactive policy is improved with guidance from the reactive policy. In this work we study this Dual Policy Iteration (DPI) strategy in an alternating optimization framework and provide a convergence analysis that extends existing API theory. We also develop a special instance of this framework which reduces the update of non-reactive policies to model-based optimal control using learned local models, and provides a theoretically sound way of unifying model-free and model-based RL approaches with unknown dynamics. We demonstrate the efficacy of our approach on various continuous control Markov Decision Processes.

View on arXiv PDF

Similar