AIFeb 12

Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation

Shuo Lu, Jianjie Cheng, Yinuo Xu, Yongcan Yu, Lijun Sheng, Peijie Wang, Siru Jiang, Yongguan Hu, Run Ling, Yihua Shao, Ao Ma, Wei Feng

arXiv:2602.11635v12.4h-index: 9

Originality Incremental advance

AI Analysis

This work addresses a fundamental weakness in current MLLMs for researchers and developers, providing a large-scale resource to disentangle perception from reasoning, though it is incremental as it builds on existing evaluation and fine-tuning methods.

The paper tackles the problem of evaluating mathematical spatial reasoning in multimodal large language models (MLLMs), finding that most leading models fail to reach even 60% accuracy on tasks where humans achieve over 95%, and presents MathSpatial, a framework that includes a benchmark, training dataset, and structured reasoning method, with fine-tuning on Qwen2.5-VL-7B achieving competitive accuracy while reducing tokens by 25%.

Multimodal large language models (MLLMs) have achieved strong performance on perception-oriented tasks, yet their ability to perform mathematical spatial reasoning, defined as the capacity to parse and manipulate two- and three-dimensional relations, remains unclear. Humans easily solve textbook-style spatial reasoning problems with over 95\% accuracy, but we find that most leading MLLMs fail to reach even 60\% on the same tasks. This striking gap highlights spatial reasoning as a fundamental weakness of current models. To investigate this gap, we present MathSpatial, a unified framework for evaluating and improving spatial reasoning in MLLMs. MathSpatial includes three complementary components: (i) MathSpatial-Bench, a benchmark of 2K problems across three categories and eleven subtypes, designed to isolate reasoning difficulty from perceptual noise; (ii) MathSpatial-Corpus, a training dataset of 8K additional problems with verified solutions; and (iii) MathSpatial-SRT, which models reasoning as structured traces composed of three atomic operations--Correlate, Constrain, and Infer. Experiments show that fine-tuning Qwen2.5-VL-7B on MathSpatial achieves competitive accuracy while reducing tokens by 25\%. MathSpatial provides the first large-scale resource that disentangles perception from reasoning, enabling precise measurement and comprehensive understanding of mathematical spatial reasoning in MLLMs.

View on arXiv PDF

Similar