LGCLMay 20, 2025

AAPO: Enhancing the Reasoning Capabilities of LLMs with Advantage Momentum

arXiv:2505.14264v24 citationsh-index: 8
Originality Incremental advance
AI Analysis

This work addresses training inefficiencies in RL methods for enhancing LLM reasoning, offering an incremental improvement for AI researchers and practitioners.

The paper tackles inefficiencies in group relative advantage estimation for RL-based post-training of LLMs, proposing AAPO which uses momentum-enhanced advantages to improve training; experiments on mathematical reasoning benchmarks show superior performance.

Reinforcement learning (RL) has emerged as an effective approach for enhancing the reasoning capabilities of large language models (LLMs), especially in scenarios where supervised fine-tuning (SFT) falls short due to limited chain-of-thought (CoT) data. Among RL-based post-training methods, group relative advantage estimation, as exemplified by Group Relative Policy Optimization (GRPO), has attracted considerable attention for eliminating the dependency on the value model, thereby simplifying training compared to traditional approaches like Proximal Policy Optimization (PPO). However, we observe that exsiting group relative advantage estimation method still suffers from training inefficiencies, particularly when the estimated advantage approaches zero. To address this limitation, we propose Advantage-Augmented Policy Optimization (AAPO), a novel RL algorithm that optimizes the cross-entropy (CE) loss using advantages enhanced through a momentum-based estimation scheme. This approach effectively mitigates the inefficiencies associated with group relative advantage estimation. Experimental results on multiple mathematical reasoning benchmarks demonstrate the superior performance of AAPO.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes