Adaptive Group Policy Optimization: Towards Stable Training and Token-Efficient Reasoning
This addresses training stability and inference efficiency issues for reasoning LLMs, representing an incremental improvement over GRPO.
The paper tackles training instability and token inefficiency in Group Relative Policy Optimization (GRPO) for reasoning LLMs by proposing Adaptive Group Policy Optimization (AGPO) with an adaptive loss function, achieving more stable training and superior performance with significantly fewer tokens in reasoning steps.
Since DeepSeek-R1 popularized, Group Relative Policy Optimization (GRPO) has become the core part of training Reasoning LLMs. However, we find some deficiency that influences RL stability and inference efficiency, like zero-variance in advantage estimation. Thus, we propose Adaptive Group Policy Optimization (AGPO) which uses a simple but effective method, an adaptive loss function, to mitigate training fluctuation and token inefficiency. The experiments demonstrate our method achieves more stable training and superior performance with significantly fewer tokens in reasoning steps.