CLAIJun 30, 2025

Do Thinking Tokens Help or Trap? Towards More Efficient Large Reasoning Model

arXiv:2506.23840v112 citationsh-index: 6
Originality Incremental advance
AI Analysis

This addresses efficiency issues in LRMs for users handling constrained token budgets, though it is incremental as it builds on existing models.

The paper tackles the overthinking dilemma in Large Reasoning Models (LRMs), where thinking tokens cause verbose responses and reduce efficiency on simple tasks, and proposes Dual Policy Preference Optimization (DuP-PO) to improve token efficiency while maintaining or enhancing performance on math reasoning benchmarks.

Large Reasoning Models (LRMs) excel at solving complex problems but face an overthinking dilemma. When handling simple tasks, they often produce verbose responses overloaded with thinking tokens (e.g., wait, however). These tokens trigger unnecessary high-level reasoning behaviors like reflection and backtracking, reducing efficiency. In this work, our pilot study reveals that these thinking-token-induced behaviors are not essential for effective problem-solving and may even hinder correct reasoning within constrained token budgets. We identify this phenomenon as the thinking trap. To mitigate this issue, we propose Dual Policy Preference Optimization (DuP-PO), a novel algorithm featuring: (1) A rollout sampling strategy that guarantees balanced exposure to responses with and without thinking tokens; (2) A fine-grained advantage control technique to dynamically regulate the prediction of target tokens; (3) A policy shaping method ensuring stable gradient contributions from thinking tokens. Experimental results on five popular math reasoning benchmarks show that DuP-PO performs well on the popular LRM, which significantly improves their token efficiency during reasoning, while achieving superior performance of the base model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes