AIMay 7

AGPO: Asymmetric Group Policy Optimization for Verifiable Reasoning and Search Ads Relevance at JD

arXiv:2605.0582666.6h-index: 5
Predicted impact top 55% in AI · last 90 daysOriginality Incremental advance
AI Analysis

For practitioners of RLVR in LLMs, AGPO addresses the problem of reasoning boundary shrinkage, offering a method that maintains exploration capacity while improving accuracy.

AGPO proposes a reinforcement learning method that suppresses incorrect reasoning paths while scaling positive updates based on intra-group variance, achieving state-of-the-art accuracy on five mathematical benchmarks and improving pass@k performance at scale. In a search ads relevance task at JD, it enhanced data annotation quality, leading to substantial gains in downstream models.

Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated notable success in enhancing the reasoning performance of large language models (LLMs). However, recent studies reveal that while current RLVR methods improve sampling efficiency towards correct paths, they do not elicit fundamentally new reasoning patterns. Instead, the reasoning capability boundary of trained models often narrows compared to their base models, with base models achieving higher coverage at large sample sizes. In this work, we propose Asymmetric Group Policy Optimization (AGPO) to counteract this boundary shrinkage. AGPO adopts a negative-dominant reinforcement strategy to suppress incorrect reasoning paths, maintaining the base model's exploration capacity. For positive reinforcement, AGPO adopts a group advantage mechanism, which scales positive updates based on intra-group variance, allowing the model to focus on rare correct paths while suppressing updates from trivial paths. Our experiments on five mathematical benchmarks demonstrate that AGPO achieves state-of-the-art accuracy while consistently improving pass@$k$ performance at scale. In a large-scale industrial application for search ads relevance optimization, AGPO effectively enhances the quality of the data annotation, leading to substantial performance gains in downstream student models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes