LG MLJan 31, 2024

Behind the Myth of Exploration in Policy Gradients

Adrien Bolland, Gaspard Lambrechts, Damien Ernst

arXiv:2402.00162v37.93 citationsh-index: 5

Originality Incremental advance

AI Analysis

This work addresses the theoretical understanding of exploration in reinforcement learning for researchers, revealing its optimization benefits rather than just exploration needs, though it is incremental in nature.

The paper analyzes the role of intrinsic exploration terms in policy-gradient algorithms, showing that they smooth the learning objective to eliminate local optima and improve gradient estimates for better policy optimization, with empirical illustrations using entropy bonuses.

In order to compute near-optimal policies with policy-gradient algorithms, it is common in practice to include intrinsic exploration terms in the learning objective. Although the effectiveness of these terms is usually justified by an intrinsic need to explore environments, we propose a novel analysis with the lens of numerical optimization. Two criteria are introduced on the learning objective and two others on its stochastic gradient estimates, and are afterwards used to discuss the quality of the policy after optimization. The analysis sheds light on two separate effects of exploration techniques. First, they make it possible to smooth the learning objective and to eliminate local optima while preserving the global maximum. Second, they modify the gradient estimates, increasing the probability that the stochastic parameter updates eventually provide an optimal policy. We empirically illustrate these effects with exploration strategies based on entropy bonuses, identifying limitations and suggesting directions for future work.

View on arXiv PDF

Similar