CVMMMay 25, 2025

Can Multimodal Large Language Models Understand Spatial Relations?

arXiv:2505.19015v221 citationsh-index: 12Has CodeACL
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of improving spatial understanding in MLLMs for applications in AI and robotics, but it is incremental as it focuses on benchmarking rather than proposing a new method.

The paper tackles the problem of spatial relation reasoning in multimodal large language models (MLLMs) by introducing SpatialMQA, a human-annotated benchmark based on COCO2017, and finds that the current state-of-the-art MLLM achieves only 48.14% accuracy, far below human-level accuracy of 98.40%.

Spatial relation reasoning is a crucial task for multimodal large language models (MLLMs) to understand the objective world. However, current benchmarks have issues like relying on bounding boxes, ignoring perspective substitutions, or allowing questions to be answered using only the model's prior knowledge without image understanding. To address these issues, we introduce SpatialMQA, a human-annotated spatial relation reasoning benchmark based on COCO2017, which enables MLLMs to focus more on understanding images in the objective world. To ensure data quality, we design a well-tailored annotation procedure, resulting in SpatialMQA consisting of 5,392 samples. Based on this benchmark, a series of closed- and open-source MLLMs are implemented and the results indicate that the current state-of-the-art MLLM achieves only 48.14% accuracy, far below the human-level accuracy of 98.40%. Extensive experimental analyses are also conducted, suggesting the future research directions. The benchmark and codes are available at https://github.com/ziyan-xiaoyu/SpatialMQA.git.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes