Abstract 3D Perception for Spatial Intelligence in Vision-Language Models
This addresses a key limitation for robotics and embodied agents by enhancing 3D reasoning in VLMs without extra training, though it appears incremental as it builds on existing VLM backbones.
The paper tackled the problem of vision-language models struggling with 3D tasks due to a modality gap between 2D training and 3D requirements, and introduced SandboxVLM, which improved spatial intelligence with an 8.3% gain on SAT Real in zero-shot settings.
Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding, which are crucial for real-world applications like robotics and embodied agents. We attribute this to a modality gap between the 3D tasks and the 2D training of VLM, which led to inefficient retrieval of 3D information from 2D input. To bridge this gap, we introduce SandboxVLM, a simple yet effective framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM. Specifically, we design a 3D Sandbox reconstruction and perception pipeline comprising four stages: generating multi-view priors with abstract control, proxy elevation, multi-view voting and clustering, and 3D-aware reasoning. Evaluated in zero-shot settings across multiple benchmarks and VLM backbones, our approach consistently improves spatial intelligence, achieving an 8.3\% gain on SAT Real compared with baseline methods for instance. These results demonstrate that equipping VLMs with a 3D abstraction substantially enhances their 3D reasoning ability without additional training, suggesting new possibilities for general-purpose embodied intelligence.