Not All Tokens Are What You Need In Thinking
This addresses computational bottlenecks for users of large reasoning models by reducing inference costs and latency, though it is incremental as it builds on existing chain-of-thought methods.
The paper tackles inefficiencies in reasoning models like high latency and overthinking by proposing Conditional Token Selection (CTS), a token-level compression framework that reduces reasoning tokens by up to 75.8% while maintaining or improving accuracy, such as a 9.1% accuracy gain with 13.2% fewer tokens on GPQA.
Modern reasoning models, such as OpenAI's o1 and DeepSeek-R1, exhibit impressive problem-solving capabilities but suffer from critical inefficiencies: high inference latency, excessive computational resource consumption, and a tendency toward overthinking -- generating verbose chains of thought (CoT) laden with redundant tokens that contribute minimally to the final answer. To address these issues, we propose Conditional Token Selection (CTS), a token-level compression framework with a flexible and variable compression ratio that identifies and preserves only the most essential tokens in CoT. CTS evaluates each token's contribution to deriving correct answers using conditional importance scoring, then trains models on compressed CoT. Extensive experiments demonstrate that CTS effectively compresses long CoT while maintaining strong reasoning performance. Notably, on the GPQA benchmark, Qwen2.5-14B-Instruct trained with CTS achieves a 9.1% accuracy improvement with 13.2% fewer reasoning tokens (13% training token reduction). Further reducing training tokens by 42% incurs only a marginal 5% accuracy drop while yielding a 75.8% reduction in reasoning tokens, highlighting the prevalence of redundancy in existing CoT.