AICLLGFeb 11, 2025

When More is Less: Understanding Chain-of-Thought Length in LLMs

arXiv:2502.07266v3175 citationsh-index: 14
Originality Highly original
AI Analysis

This research tackles the problem of optimizing Chain-of-Thought reasoning in Large Language Models, which is significant for the development of more efficient and effective language models for a wide range of applications.

The authors found that the accuracy of Large Language Models (LLMs) using Chain-of-Thought (CoT) reasoning follows an inverted U-shaped curve with CoT length, with performance initially improving but eventually decreasing as the number of CoT steps increases. They demonstrated significant practical benefits from training with optimally-lengthed CoTs, with optimal length increasing with task difficulty but decreasing with model capability.

Large Language Models (LLMs) employ Chain-of-Thought (CoT) reasoning to deconstruct complex problems. While longer CoTs are often presumed superior, this paper challenges that notion, arguing that longer is not always better. Drawing on combined evidence from real-world observations, controlled experiments, and theoretical analysis, we demonstrate that task accuracy typically follows an inverted U-shaped curve with CoT length, where performance initially improves but eventually decreases as the number of CoT steps increases. With controlled experiments, we further uncover the scaling behaviors of the optimal CoT length: it increases with task difficulty but decreases with model capability, exposing an inherent simplicity bias where more capable models favor shorter, more efficient CoT reasoning. This bias is also evident in Reinforcement Learning (RL) training, where models gravitate towards shorter CoTs as their accuracy improves. To have a deep understanding of these dynamics, we establish a simple theoretical model that formally proves these phenomena, including the optimal length's scaling laws and the emergence of simplicity bias during RL. Guided by this framework, we demonstrate significant practical benefits from training with optimally-lengthed CoTs and employing length-aware filtering at inference. These findings offer both a principled understanding of the "overthinking" phenomenon and multiple practical guidelines for CoT calibration, enabling LLMs to achieve optimal reasoning performance with adaptive CoTs tailored to task complexity and model capability.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes