Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation
This work addresses the problem of overthinking in large reasoning models, improving efficiency and accuracy for users of these models, particularly in scenarios with varied query complexity.
Large reasoning models (LRMs) often overthink simple queries, leading to unstable accuracy-efficiency trade-offs. This paper proposes a two-stage framework that uses Hybrid Fine-Tuning, Correctness-Preserving Advantage Shaping (CPAS), and Length-Aware Gradient Regulation (LAGR) to achieve stable adaptive thinking, resulting in up to +3.7/+3.6 accuracy points and a 40.6%/43.9% reduction in generated tokens on Qwen2.5-1.5B and 7B.
Large reasoning models (LRMs) achieve strong performance through extended reasoning traces, but they often exhibit overthinking behavior for low-complexity queries. Existing efforts to mitigate this issue are fundamentally limited by unstable accuracy-efficiency trade-offs and poor robustness to heterogeneous reasoning behaviors. To address these challenges, we propose a two-stage framework for stable adaptive thinking in LRMs. The framework first applies Hybrid Fine-Tuning to expose the model to both thinking and no-thinking behaviors, establishing well-conditioned initialization. It then performs adaptive reinforcement learning with Correctness-Preserving Advantage Shaping (CPAS) to avoid suppressing correct long-chain reasoning, and Length-Aware Gradient Regulation (LAGR) to stabilize optimization under severe reasoning-length heterogeneity. Extensive experiments on Qwen2.5-1.5B and 7B show consistent improvements over strong baselines, achieving up to +3.7/+3.6 accuracy points while reducing generated tokens by 40.6%/43.9%. Further analyses across varying problem difficulties and out-of-distribution tasks confirm the robustness and generalization of our approach.