SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors
This work addresses the need for explicit 3D understanding in VLMs for robotics and spatial reasoning applications, representing an incremental improvement by integrating existing 3D priors without training.
The paper tackled the problem of enhancing spatial reasoning in vision-language models for higher-level 3D-aware tasks like dynamic scene changes and motion planning, and the result was that SpatialPIN, a zero-shot framework, performed well on spatial VQA and extended to robotics tasks such as pick and stack and trajectory planning.
Current state-of-the-art spatial reasoning-enhanced VLMs are trained to excel at spatial visual question answering (VQA). However, we believe that higher-level 3D-aware tasks, such as articulating dynamic scene changes and motion planning, require a fundamental and explicit 3D understanding beyond current spatial VQA datasets. In this work, we present SpatialPIN, a framework designed to enhance the spatial reasoning capabilities of VLMs through prompting and interacting with priors from multiple 3D foundation models in a zero-shot, training-free manner. Extensive experiments demonstrate that our spatial reasoning-imbued VLM performs well on various forms of spatial VQA and can extend to help in various downstream robotics tasks such as pick and stack and trajectory planning.