LGMLJan 31, 2024

Behind the Myth of Exploration in Policy Gradients

arXiv:2402.00162v33 citationsh-index: 5
Originality Incremental advance
AI Analysis

This work addresses the theoretical understanding of exploration in reinforcement learning for researchers, revealing its optimization benefits rather than just exploration needs, though it is incremental in nature.

The paper analyzes the role of intrinsic exploration terms in policy-gradient algorithms, showing that they smooth the learning objective to eliminate local optima and improve gradient estimates for better policy optimization, with empirical illustrations using entropy bonuses.

In order to compute near-optimal policies with policy-gradient algorithms, it is common in practice to include intrinsic exploration terms in the learning objective. Although the effectiveness of these terms is usually justified by an intrinsic need to explore environments, we propose a novel analysis with the lens of numerical optimization. Two criteria are introduced on the learning objective and two others on its stochastic gradient estimates, and are afterwards used to discuss the quality of the policy after optimization. The analysis sheds light on two separate effects of exploration techniques. First, they make it possible to smooth the learning objective and to eliminate local optima while preserving the global maximum. Second, they modify the gradient estimates, increasing the probability that the stochastic parameter updates eventually provide an optimal policy. We empirically illustrate these effects with exploration strategies based on entropy bonuses, identifying limitations and suggesting directions for future work.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes