LG AISep 26, 2025

Quantile Advantage Estimation for Entropy-Safe Reasoning

Junkang Wu, Kexin Huang, Jiancan Wu, An Zhang, Xiang Wang, Xiangnan He

arXiv:2509.22611v122.69 citationsh-index: 24

Originality Incremental advance

AI Analysis

This addresses stability issues in RL for LLM reasoning, offering a novel baseline design that could scale RLVR, though it appears incremental as it modifies existing methods.

The paper tackles the problem of entropy collapse and explosion in reinforcement learning for LLM reasoning by proposing Quantile Advantage Estimation, which stabilizes entropy and improves pass@1 gains on benchmarks like AIME 2024/2025 and AMC 2023.

Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning, but training often oscillates between {entropy collapse} and {entropy explosion}. We trace both hazards to the mean baseline used in value-free RL (e.g., GRPO and DAPO), which improperly penalizes negative-advantage samples under reward outliers. We propose {Quantile Advantage Estimation} (QAE), replacing the mean with a group-wise K-quantile baseline. QAE induces a response-level, two-regime gate: on hard queries (p <= 1 - K) it reinforces rare successes, while on easy queries (p > 1 - K) it targets remaining failures. Under first-order softmax updates, we prove {two-sided entropy safety}, giving lower and upper bounds on one-step entropy change that curb explosion and prevent collapse. Empirically, this minimal modification stabilizes entropy, sparsifies credit assignment (with tuned K, roughly 80% of responses receive zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME 2024/2025 and AMC 2023. These results identify {baseline design} -- rather than token-level heuristics -- as the primary mechanism for scaling RLVR.

View on arXiv PDF

Similar