AI LGOct 19, 2025

DAG-Math: Graph-Guided Mathematical Reasoning in LLMs

Yuanhe Zhang, Ilja Kuzborskij, Jason D. Lee, Chenlei Leng, Fanghui Liu

arXiv:2510.19842v111.13 citationsh-index: 17Has Code

Originality Incremental advance

AI Analysis

This work addresses the need for better evaluation of reasoning in LLMs for researchers and developers, though it is incremental as it builds on existing CoT methods by adding a structured diagnostic framework.

The authors tackled the problem of evaluating whether LLMs perform genuine mathematical reasoning or just pattern matching by proposing a framework that models Chain-of-Thought as a rule-based process over directed acyclic graphs, introducing a metric called logical closeness to assess reasoning fidelity. Their analysis on standard datasets revealed statistically significant differences in reasoning fidelity among LLM families, even when final-answer accuracy was comparable, highlighting gaps between accuracy and rule-consistent derivation.

Large Language Models (LLMs) demonstrate strong performance on mathematical problems when prompted with Chain-of-Thought (CoT), yet it remains unclear whether this success stems from search, rote procedures, or rule-consistent reasoning. To address this, we propose modeling CoT as a certain rule-based stochastic process over directed acyclic graphs (DAGs), where nodes represent intermediate derivation states and edges encode rule applications. Within this framework, we introduce logical closeness, a metric that quantifies how well a model's CoT trajectory (i.e., the LLM's final output) adheres to the DAG structure, providing evaluation beyond classical PASS@k metrics. Building on this, we introduce the DAG-MATH CoT format and construct a benchmark that guides LLMs to generate CoT trajectories in this format, thereby enabling the evaluation of their reasoning ability under our framework. Across standard mathematical reasoning datasets, our analysis uncovers statistically significant differences in reasoning fidelity among representative LLM families-even when PASS@k is comparable-highlighting gaps between final-answer accuracy and rule-consistent derivation. Our framework provides a balance between free-form CoT and formal proofs systems, offering actionable diagnostics for LLMs reasoning evaluation. Our benchmark and code are available at: https://github.com/YuanheZ/DAG-MATH-Formatted-CoT.

View on arXiv PDF Code

Similar