Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse
This work identifies specific conditions where CoT harms model performance, offering insights for practitioners to avoid misapplying reasoning techniques, though it is incremental in connecting existing human psychology to model evaluation.
The paper investigates tasks where chain-of-thought (CoT) prompting reduces performance in large language models, inspired by cognitive psychology findings that deliberation can hurt human performance. It finds that in three of six tasks, state-of-the-art models show significant accuracy drops with CoT, up to 36.3% for OpenAI o1-preview compared to GPT-4o.
Chain-of-thought (CoT) prompting has become a widely used strategy for improving large language and multimodal model performance. However, it is still an open question under which settings CoT systematically reduces performance. In this paper, we seek to identify the characteristics of tasks where CoT reduces performance by drawing inspiration from cognitive psychology, focusing on six representative tasks from the psychological literature where deliberation hurts performance in humans. In three of these tasks, state-of-the-art models exhibit significant performance drop-offs with CoT (up to 36.3\% absolute accuracy for OpenAI o1-preview compared to GPT-4o), while in others, CoT effects are mixed, with positive, neutral, and negative changes. While models and humans do not exhibit perfectly parallel cognitive processes, considering cases where thinking has negative consequences for humans helps identify settings where it negatively impacts models. By connecting the literature on human verbal thinking and deliberation with evaluations of CoT, we offer a perspective for understanding the impact of inference-time reasoning.