CL LGMay 21, 2025

Can LLMs $\textit{understand}$ Math? -- Exploring the Pitfalls in Mathematical Reasoning

Tiasa Singha Roy, Aditeya Baral, Ayush Rajesh Jhaveri, Yusuf Baig

arXiv:2505.15623v16.73 citationsh-index: 3IJCNN

Originality Incremental advance

AI Analysis

This work addresses the challenge of assessing mathematical understanding in LLMs for researchers and developers, offering a more holistic evaluation metric beyond accuracy.

The study tackled the problem of evaluating mathematical reasoning in large language models by proposing a novel evaluation framework, resulting in the MAPLE score that integrates error rates, redundancy, and validity to quantify reasoning misalignment.

Large language models (LLMs) demonstrate considerable potential in various natural language tasks but face significant challenges in mathematical reasoning, particularly in executing precise, multi-step logic. However, current evaluation frameworks judge their performance solely based on accuracy, which only accounts for the final answer. This study explores these pitfalls by employing a novel evaluation framework. We propose an evaluation metric called the MAPLE score, which holistically quantifies reasoning misalignment by integrating error rates, redundancy, and validity.

View on arXiv PDF

Similar