AILGOct 17, 2025

Towards Flash Thinking via Decoupled Advantage Policy Optimization

arXiv:2510.15374v14 citationsh-index: 2
Originality Incremental advance
AI Analysis

This addresses inefficiencies in reasoning models for tasks requiring minimal computation, though it appears incremental as it builds on existing RL methods.

The paper tackled the problem of excessively lengthy responses and overthinking in Large Reasoning Models, which increases inference latency and computational consumption, by proposing a novel RL framework called DEPO that reduces sequence length by 39% while improving overall accuracy.

Recent Large Reasoning Models (LRMs) have achieved remarkable performance in solving complex problems via supervised fine-tuning (SFT) and reinforcement learning (RL). Although existing RL algorithms significantly enhance model accuracy, they still suffer from excessively lengthy responses and overthinking issues, resulting in increased inference latency and computational consumption, especially for simple tasks that require minimal reasoning. To address this, we propose a novel RL framework, DEPO, to reduce inefficient reasoning for models. Our method mainly consists of three core components: (1) an innovative advantage decoupled algorithm to guide model reduction of inefficient tokens; (2) a difficulty-aware length penalty to lower the overall length of model responses; (3) an advantage clipping method to prevent bias in policy optimization. In our experiments, applied to DeepSeek-Distill-Qwen-7B and DeepSeek-Distill-Qwen-1.5B as base models, DEPO achieves a significant reduction in sequence length by 39% and reduces excessive reasoning paths in inefficient tokens, while outperforming the base model in overall accuracy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes