STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?
This addresses the need for robust benchmarks to assess MLLMs' capabilities in critical domains like robotics and autonomous systems, though it is incremental as it focuses on evaluation rather than proposing new methods.
The paper tackles the problem of evaluating Multimodal Large Language Models (MLLMs) for precise spatial-temporal understanding in real-world applications like Embodied AI and Autonomous Driving, and finds that state-of-the-art MLLMs struggle, particularly in tasks requiring precise distance estimation and motion analysis.
The use of Multimodal Large Language Models (MLLMs) as an end-to-end solution for Embodied AI and Autonomous Driving has become a prevailing trend. While MLLMs have been extensively studied for visual semantic understanding tasks, their ability to perform precise and quantitative spatial-temporal understanding in real-world applications remains largely unexamined, leading to uncertain prospects. To evaluate models' Spatial-Temporal Intelligence, we introduce STI-Bench, a benchmark designed to evaluate MLLMs' spatial-temporal understanding through challenging tasks such as estimating and predicting the appearance, pose, displacement, and motion of objects. Our benchmark encompasses a wide range of robot and vehicle operations across desktop, indoor, and outdoor scenarios. The extensive experiments reveals that the state-of-the-art MLLMs still struggle in real-world spatial-temporal understanding, especially in tasks requiring precise distance estimation and motion analysis.