LGOCJul 17, 2025

Improving DAPO from a Mixed-Policy Perspective

arXiv:2507.12931v3
Originality Incremental advance
AI Analysis

This work addresses a specific problem in reinforcement learning for researchers and practitioners, offering incremental improvements to existing methods.

The paper tackles instability and sample inefficiency in policy gradient methods, particularly in sparse reward settings, by introducing two modifications to the Dynamic sAmpling Policy Optimization (DAPO) algorithm from a mixed-policy perspective, resulting in improved training stability, convergence speed, and sample efficiency.

This paper introduces two novel modifications to the Dynamic sAmpling Policy Optimization (DAPO) algorithm [1], approached from a mixed-policy perspective. Standard policy gradient methods can suffer from instability and sample inefficiency, particularly in sparse reward settings. To address this, we first propose a method that incorporates a pre-trained, stable guiding policy ($\piphi$) to provide off-policy experience, thereby regularizing the training of the target policy ($\pion$). This approach improves training stability and convergence speed by adaptively adjusting the learning step size. Secondly, we extend this idea to re-utilize zero-reward samples, which are often discarded by dynamic sampling strategies like DAPO's. By treating these samples as a distinct batch guided by the expert policy, we further enhance sample efficiency. We provide a theoretical analysis for both methods, demonstrating that their objective functions converge to the optimal solution within the established theoretical framework of reinforcement learning. The proposed mixed-policy framework effectively balances exploration and exploitation, promising more stable and efficient policy optimization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes