Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying
For RL practitioners, this provides a principled way to induce exploration as an emergent property, eliminating the need for handcrafted exploration bonuses.
The paper introduces ReMax, an objective that evaluates a policy by the expected maximum return over multiple samples, and derives a policy-gradient formulation (RePPO) that induces stochastic exploration without explicit bonuses. RePPO outperforms baselines on MinAtar and Craftax benchmarks.
In reinforcement learning (RL), agents benefit from exploration only because they repeatedly encounter similar states: trying different actions can improve performance or reduce uncertainty; without such retries, a greedy policy is optimal. We formalize this intuition with ReMax, an objective that evaluates a policy by the expected maximum return over $M$ samples, where $M$ is a positive integer, while accounting for return uncertainty. Optimizing this objective induces stochastic exploration as an emergent property, without explicit bonus terms. For efficient policy optimization, we derive a new policy-gradient formulation for ReMax and introduce ReMax PPO (RePPO), a PPO variant that optimizes ReMax while generalizing the discrete retry count $M$ to a continuous parameter $m > 0$, enabling fine-grained control of exploration. Empirically, RePPO promotes exploration, without any explicit exploration bonuses, on the MinAtar and Craftax benchmarks.