LGOCMLJun 12, 2020

Zeroth-order Deterministic Policy Gradient

arXiv:2006.07314v215 citations
AI Analysis

This addresses a bottleneck in reinforcement learning for dynamic problems by enabling model-free learning without critics, though it is an incremental improvement over existing DPG methods.

The paper tackles the problem of deterministic policy gradient (DPG) methods requiring critics for model-free learning by introducing Zeroth-order Deterministic Policy Gradient (ZDPG), which approximates gradients using two-point stochastic evaluations of the Q-function, resulting in improved sample complexity bounds by up to two orders of magnitude.

Deterministic Policy Gradient (DPG) removes a level of randomness from standard randomized-action Policy Gradient (PG), and demonstrates substantial empirical success for tackling complex dynamic problems involving Markov decision processes. At the same time, though, DPG loses its ability to learn in a model-free (i.e., actor-only) fashion, frequently necessitating the use of critics in order to obtain consistent estimates of the associated policy-reward gradient. In this work, we introduce Zeroth-order Deterministic Policy Gradient (ZDPG), which approximates policy-reward gradients via two-point stochastic evaluations of the $Q$-function, constructed by properly designed low-dimensional action-space perturbations. Exploiting the idea of random horizon rollouts for obtaining unbiased estimates of the $Q$-function, ZDPG lifts the dependence on critics and restores true model-free policy learning, while enjoying built-in and provable algorithmic stability. Additionally, we present new finite sample complexity bounds for ZDPG, which improve upon existing results by up to two orders of magnitude. Our findings are supported by several numerical experiments, which showcase the effectiveness of ZDPG in a practical setting, and its advantages over both PG and Baseline PG.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes