AISep 23, 2025

MAPO: Mixed Advantage Policy Optimization

arXiv:2509.18849v39 citationsh-index: 16
Originality Incremental advance
AI Analysis

This work addresses a specific bottleneck in reinforcement learning for foundation models, offering an incremental improvement for researchers and practitioners in this domain.

The paper tackles the problem of advantage reversion and advantage mirror issues in reinforcement learning for foundation models, which hinder proper advantage allocation across query samples, and proposes Mixed Advantage Policy Optimization (MAPO) to dynamically reweight the advantage function based on trajectory certainty, showing effectiveness through comparisons with state-of-the-art methods and ablation studies.

Recent advances in reinforcement learning for foundation models, such as Group Relative Policy Optimization (GRPO), have significantly improved the performance of foundation models on reasoning tasks. Notably, the advantage function serves as a central mechanism in GRPO for ranking the trajectory importance. However, existing explorations encounter both advantage reversion and advantage mirror problems, which hinder the reasonable advantage allocation across different query samples. In this work, we propose an easy but effective GRPO strategy, Mixed Advantage Policy Optimization (MAPO). We reveal that the trajectory appears with different certainty and propose the advantage percent deviation for samples with high-certainty trajectories. Furthermore, we dynamically reweight the advantage function for samples with varying trajectory certainty, thereby adaptively configuring the advantage function to account for sample-specific characteristics. Comparison with related state-of-the-art methods, along with ablation studies on different advantage variants, validates the effectiveness of our approach.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes