LGMar 13, 2025

Evaluating Mathematical Reasoning Across Large Language Models: A Fine-Grained Approach

Afrar Jahin, Arif Hassan Zidan, Wei Zhang, Yu Bao, Tianming Liu

arXiv:2503.10573v211.48 citationsh-index: 4

Originality Incremental advance

AI Analysis

This work addresses the problem of limited comprehensive evaluations for mathematical reasoning in LLMs, providing insights for researchers and developers to improve model alignment with rigorous reasoning demands, though it is incremental as it builds on prior evaluation efforts.

The study systematically evaluated mathematical reasoning abilities across eight leading large language models using three benchmark datasets, finding that DeepSeek-R1 performs competitively with o1 and achieves the highest accuracy on the MMLU Formal Logic benchmark, while distilled variants show substantial performance degradation and Gemini 2.0 Flash has the lowest response latency.

With the rapid advancement of Artificial Intelligence (AI), Large Language Models (LLMs) have significantly impacted a wide array of domains, including healthcare, engineering, science, education, and mathematical reasoning. Among these, mathematical reasoning remains a particularly challenging capability, often requiring multi-step logic and abstract generalization. While prior work has explored LLM performance on reasoning tasks, comprehensive evaluations that span both depth and breadth across model families remain limited. In this study, we present a systematic evaluation of mathematical reasoning abilities across eight leading LLMs, including two recent DeepSeek models, using three independent benchmark datasets. Our analyses reveal several key findings: (1) DeepSeek-R1 performs competitively with o1 across most domains and achieves the highest accuracy on the MMLU Formal Logic benchmark; (2) distilled variants, such as DeepSeek-1.5B, exhibit substantial performance degradation; and (3) Gemini 2.0 Flash achieves the lowest response latency. Beyond quantitative metrics, we explore how architectural choices, training paradigms, and optimization strategies contribute to variation in reasoning performance. These findings provide new insights into the capabilities and limitations of current LLMs in mathematical domains, and offer guidance for the development of future models better aligned with rigorous reasoning demands.

View on arXiv PDF

Similar