AIMay 22, 2025

SMART: Self-Generating and Self-Validating Multi-Dimensional Assessment for LLMs' Mathematical Problem Solving

Yujie Hou, Ting Zhang, Mei Wang, Xuetao Ma, Hua Huang

arXiv:2505.16646v4h-index: 2

Originality Incremental advance

AI Analysis

This addresses the need for more interpretable and fine-grained assessment of LLMs' reasoning in mathematics, though it is incremental as it builds on existing evaluation concerns.

The authors tackled the problem of evaluating LLMs' mathematical problem-solving beyond superficial metrics by introducing SMART, a framework that decomposes the process into four cognitive dimensions, and found significant discrepancies in abilities across 21 LLMs, leading to a new All-Pass Score metric.

Large Language Models (LLMs) have achieved remarkable results on a variety of mathematical benchmarks. However, concerns remain as to whether these successes reflect genuine reasoning or superficial pattern recognition. Common evaluation methods, which focus on the either the final answer or the reasoning process, fail to assess the entire problem-solving procedure. To address these limitations, we introduce SMART: a Self-Generating and Self-Validating Multi-Dimensional Assessment Framework, together with its corresponding benchmark, SMART-Bench. SMART decomposes the entire problem solving process into four distinct cognitive dimensions: Understanding, Reasoning, Arithmetic, and Reflection \& Refinement. Each dimension is evaluated independently through tailored tasks, enabling interpretable and fine-grained analysis of LLM behavior. We apply SMART to 21 state-of-the-art open- and closed-source LLMs, uncovering significant discrepancies in their abilities across different dimensions. Our findings reveal genuine weaknesses in current LLMs and motivate a new metric, the All-Pass Score, to better capture true problem-solving capabilities. Code and benchmarks will be released upon acceptance.

View on arXiv PDF

Similar