Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study
This work addresses the need for better evaluation of reasoning processes in LLMs, which is crucial for AI reliability, though it is incremental in refining existing methods.
The authors tackled the problem of evaluating logical reasoning in large language models (LLMs) by introducing FineLogic, a fine-grained framework that assesses accuracy, stepwise soundness, and representation-level probing, and found that natural language supervision generalizes better while symbolic supervision improves stepwise soundness.
Logical reasoning is a core capability for large language models (LLMs), yet existing benchmarks that rely solely on final-answer accuracy fail to capture the quality of the reasoning process. To address this, we introduce FineLogic, a fine-grained evaluation framework that assesses logical reasoning across three dimensions: overall accuracy, stepwise soundness, and representation-level probing. Leveraging this framework, we conduct a comprehensive study on how different supervision formats in fine-tuning shape reasoning abilities. We fine-tune LLMs on four supervision styles: one in natural language and three symbolic variants. We find a key trade-off: natural language supervision excels at generalization to out-of-distribution and long-chain problems, whereas symbolic supervision is superior at instilling structurally sound, atomic reasoning steps. Furthermore, our probing analysis indicates that fine-tuning primarily refines the model's step-by-step generation process, rather than improving its ability to converge on an answer early. Together, our framework and analysis provide a more rigorous lens for evaluating and improving logical reasoning in LLMs. The code is available at https://github.com/YujunZhou/FineLogic.