Retry Policy Gradients in Continuous Action Spaces
For reinforcement learning in continuous control, this provides a method to promote exploration without explicit entropy bonuses, though the improvement over existing methods like SAC is incremental.
This work extends ReMax to continuous action spaces by introducing pathwise derivative estimators for retry objectives, showing that ReMax encourages stochastic exploration by biasing gradients toward higher entropy and damping convergence. The proposed algorithm, ReMAC, achieves performance comparable to SAC without entropy regularization.
Retry-based objectives such as pass@K and max@K optimize the best return obtained from multiple sampled trajectories, and recent work has shown that they can promote exploration without explicit exploration bonuses. In discrete action spaces, ReMax was shown to do so by adapting to return uncertainty. In this work, we introduce pathwise derivative estimators for retry objectives and use them to extend ReMax to continuous action spaces. We study the resulting learning dynamics and show that, even with deterministic rewards, ReMax can encourage stochastic exploration by reshaping the policy-gradient landscape. In particular, it alters gradients both in direction, biasing updates toward higher policy entropy, and in magnitude, damping gradients and slowing convergence. We further show that Adam's adaptive normalization can mitigate this damping, depending on its numerical stabilization parameter. Empirically, we instantiate this objective as ReMax Actor-Critic (ReMAC), an off-policy actor--critic algorithm that optimizes the ReMax objective using a pathwise derivative estimator. Our experiments show that ReMAC can promote higher policy entropy without entropy regularization and achieves performance comparable to SAC.