CLJan 7

SyncThink: A Training-Free Strategy to Align Inference Termination with Reasoning Saturation

Gengyang Li, Wang Cai, Yifeng Gao, Yunfang Wu

arXiv:2601.03649v13 citationsh-index: 2

Originality Incremental advance

AI Analysis

This addresses inference efficiency for large language models by reducing computational cost without retraining, though it is incremental as it builds on existing CoT methods.

The paper tackles the problem of redundant reasoning traces in Chain-of-Thought prompting, which increases inference cost, by proposing SyncThink, a training-free decoding method that reduces overhead. The result shows SyncThink achieves 62.00% average Top-1 accuracy using 656 tokens and 28.68s latency, compared to 61.22%, 2141 tokens, and 92.01s for full CoT decoding.

Chain-of-Thought (CoT) prompting improves reasoning but often produces long and redundant traces that substantially increase inference cost. We present SyncThink, a training-free and plug-and-play decoding method that reduces CoT overhead without modifying model weights. We find that answer tokens attend weakly to early reasoning and instead focus on the special token "/think", indicating an information bottleneck. Building on this observation, SyncThink monitors the model's own reasoning-transition signal and terminates reasoning. Experiments on GSM8K, MMLU, GPQA, and BBH across three DeepSeek-R1 distilled models show that SyncThink achieves 62.00 percent average Top-1 accuracy using 656 generated tokens and 28.68 s latency, compared to 61.22 percent, 2141 tokens, and 92.01 s for full CoT decoding. On long-horizon tasks such as GPQA, SyncThink can further yield up to +8.1 absolute accuracy by preventing over-thinking.

View on arXiv PDF

Similar