LGAIMLJun 11, 2020

Zeroth-Order Supervised Policy Improvement

arXiv:2006.06600v210 citations
Originality Incremental advance
AI Analysis

This work addresses sample efficiency in reinforcement learning for continuous control tasks, offering an incremental improvement over existing methods.

The authors tackled the limited sample efficiency of policy gradient algorithms in reinforcement learning by proposing Zeroth-Order Supervised Policy Improvement (ZOSPI), which uses zeroth-order optimization and supervised learning to exploit value functions globally, achieving competitive results on continuous control benchmarks with improved sample efficiency.

Policy gradient (PG) algorithms have been widely used in reinforcement learning (RL). However, PG algorithms rely on exploiting the value function being learned with the first-order update locally, which results in limited sample efficiency. In this work, we propose an alternative method called Zeroth-Order Supervised Policy Improvement (ZOSPI). ZOSPI exploits the estimated value function $Q$ globally while preserving the local exploitation of the PG methods based on zeroth-order policy optimization. This learning paradigm follows Q-learning but overcomes the difficulty of efficiently operating argmax in continuous action space. It finds max-valued action within a small number of samples. The policy learning of ZOSPI has two steps: First, it samples actions and evaluates those actions with a learned value estimator, and then it learns to perform the action with the highest value through supervised learning. We further demonstrate such a supervised learning framework can learn multi-modal policies. Experiments show that ZOSPI achieves competitive results on the continuous control benchmarks with a remarkable sample efficiency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes