CLFeb 17, 2025

Evaluating Step-by-step Reasoning Traces: A Survey

arXiv:2502.12289v351 citationsh-index: 33EMNLP
Originality Synthesis-oriented
AI Analysis

This survey addresses the fragmented progress in evaluating reasoning traces for researchers and practitioners in AI, but it is incremental as it synthesizes existing work without introducing new methods or data.

The paper tackles the problem of inconsistent evaluation practices for step-by-step reasoning traces in large language models, proposing a taxonomy with four categories to organize and review existing datasets and methods.

Step-by-step reasoning is widely used to enhance the reasoning ability of large language models (LLMs) in complex problems. Evaluating the quality of reasoning traces is crucial for understanding and improving LLM reasoning. However, existing evaluation practices are highly inconsistent, resulting in fragmented progress across evaluator design and benchmark development. To address this gap, this survey provides a comprehensive overview of step-by-step reasoning evaluation, proposing a taxonomy of evaluation criteria with four top-level categories (factuality, validity, coherence, and utility). Based on the taxonomy, we review different datasets, evaluator implementations, and recent findings, leading to promising directions for future research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes