Beware of Reasoning Overconfidence: Pitfalls in the Reasoning Process for Multi-solution Tasks
This addresses a limitation in LLM reasoning for tasks requiring diverse answers, which is incremental as it builds on existing prompting methods.
The paper tackles the problem of large language models (LLMs) performing poorly in multi-solution tasks due to reasoning overconfidence, where they express undue certainty in incomplete solutions, and finds that long chain-of-thought prompting mitigates this issue by improving exploration.
Large Language Models (LLMs) excel in reasoning tasks requiring a single correct answer, but they perform poorly in multi-solution tasks that require generating comprehensive and diverse answers. We attribute this limitation to \textbf{reasoning overconfidence}: a tendency to express undue certainty in an incomplete solution set. To examine the effect, we introduce \textit{MuSoBench}, a benchmark of multi-solution problems. Experiments show that the conventional short chain-of-thought (Short-CoT) prompting paradigm exhibits pronounced overconfidence, whereas the emerging long chain-of-thought (Long-CoT) approach mitigates it through iterative exploration and self-reflection. We further characterise observable behaviours and influential factors. To probe the underlying cause, we propose the \textbf{cognitive-rigidity hypothesis}, which posits that overconfidence arises when the reasoning process prematurely converges on a narrow set of thought paths. An attention-entropy analysis offers preliminary support for this view. These findings provide tools for assessing the completeness of LLM reasoning and highlight the need to move evaluation beyond single-answer accuracy toward comprehensive exploration.