Measuring Reasoning Utility in LLMs via Conditional Entropy Reduction
This work addresses the challenge of inefficient reasoning in LLMs for tasks like mathematical problem-solving, offering insights for designing more efficient pipelines, though it is incremental as it provides foundational analysis without new methods.
The paper tackles the problem of evaluating the utility of intermediate reasoning steps in large language models (LLMs) for improving answer accuracy, finding that decreasing conditional entropy over steps correlates with correct answers, while flat or increasing entropy often leads to errors, with incorrect reasoning paths tending to be longer.
Recent advancements in large language models (LLMs) often rely on generating intermediate reasoning steps to enhance accuracy. However, little work has examined how reasoning utility contributes to the final answer's correctness. Due to the stochastic nature of autoregressive generation, generating more context does not guarantee increased confidence in the answer. If we could predict, during generation, whether a reasoning step will be useful, we could stop early or prune ineffective steps, avoiding distractions in the final decision. We present an oracle study on MATH dataset, using Qwen2.5-32B and GPT-4o to generate reasoning chains, and then employing a separate model (Qwen3-8B) to quantify the utility of these chains for final accuracy. Specifically, we measure the model's uncertainty on the answer span Y at each reasoning step using conditional entropy (expected negative log-likelihood over the vocabulary) with context expanding step by step. Our results show a clear pattern: conditional entropy that decreases over steps is strongly associated with correct answers, whereas flat or increasing entropy often results in wrong answers. We also corroborate that incorrect reasoning paths tend to be longer than correct ones, suggesting that longer reasoning does not necessarily yield better outcomes. These findings serve as a foundation to inspire future work on designing efficient reasoning pipelines that detect and avoid unproductive reasoning early.