CV CLMay 22, 2025

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Dahua Lin, Matt Feiszli, Kevin J. Liang

arXiv:2505.17015v130.949 citationsh-index: 14

Originality Highly original

AI Analysis

This addresses the problem of enabling robust multi-frame spatial reasoning for robotics and real-world applications, representing a novel method for a known bottleneck.

The paper tackles the limitation of multi-modal large language models in spatial understanding by proposing a framework that integrates depth perception, visual correspondence, and dynamic perception for multi-frame reasoning, resulting in significant gains over baselines and proprietary systems with a dataset of over 27 million samples.

Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for robotics and other real-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with robust multi-frame spatial understanding by integrating depth perception, visual correspondence, and dynamic perception. Central to our approach is the MultiSPA dataset, a novel, large-scale collection of more than 27 million samples spanning diverse 3D and 4D scenes. Alongside MultiSPA, we introduce a comprehensive benchmark that tests a wide spectrum of spatial tasks under uniform metrics. Our resulting model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable, generalizable multi-frame reasoning. We further observe multi-task benefits and early indications of emergent capabilities in challenging scenarios, and showcase how our model can serve as a multi-frame reward annotator for robotics.

View on arXiv PDF

Similar