CLMay 21, 2025

ThinkLess: A Training-Free Inference-Efficient Method for Reducing Reasoning Redundancy

Gengyang Li, Yifeng Gao, Yuming Li, Yunfang Wu

arXiv:2505.15684v220.419 citationsh-index: 2

Originality Incremental advance

AI Analysis

This addresses efficiency issues in reasoning for LLM users, but it is incremental as it builds on existing CoT methods without modifying the model.

The paper tackles the problem of excessive reasoning token length in Chain-of-Thought prompting for large language models, which increases latency and memory usage, by proposing ThinkLess, a training-free method that terminates reasoning early and reduces decoding time and memory consumption while maintaining comparable accuracy.

While Chain-of-Thought (CoT) prompting improves reasoning in large language models (LLMs), the excessive length of reasoning tokens increases latency and KV cache memory usage, and may even truncate final answers under context limits. We propose ThinkLess, an inference-efficient framework that terminates reasoning generation early and maintains output quality without modifying the model. Atttention analysis reveals that answer tokens focus minimally on earlier reasoning steps and primarily attend to the reasoning terminator token, due to information migration under causal masking. Building on this insight, ThinkLess inserts the terminator token at earlier positions to skip redundant reasoning while preserving the underlying knowledge transfer. To prevent format discruption casued by early termination, ThinkLess employs a lightweight post-regulation mechanism, relying on the model's natural instruction-following ability to produce well-structured answers. Without fine-tuning or auxiliary data, ThinkLess achieves comparable accuracy to full-length CoT decoding while greatly reducing decoding time and memory consumption.

View on arXiv PDF

Similar