Visuospatial Cognitive Assistant
This work addresses the problem of visuospatial reasoning for robotics and AI applications, offering incremental improvements through targeted datasets and models.
The paper tackles the challenge of video-based spatial cognition for robotics and embodied AI by introducing ViCA-322K, a dataset of 322,003 QA pairs from real-world indoor videos, and ViCA-7B, a fine-tuned model that achieves new state-of-the-art on all eight VSI-Bench tasks, with improvements like +26.1 on Absolute Distance.
Video-based spatial cognition is vital for robotics and embodied AI but challenges current Vision-Language Models (VLMs). This paper makes two key contributions. First, we introduce ViCA (Visuospatial Cognitive Assistant)-322K, a diverse dataset of 322,003 QA pairs from real-world indoor videos (ARKitScenes, ScanNet, ScanNet++), offering supervision for 3D metadata-grounded queries and video-based complex reasoning. Second, we develop ViCA-7B, fine-tuned on ViCA-322K, which achieves new state-of-the-art on all eight VSI-Bench tasks, outperforming existing models, including larger ones (e.g., +26.1 on Absolute Distance). For interpretability, we present ViCA-Thinking-2.68K, a dataset with explicit reasoning chains, and fine-tune ViCA-7B to create ViCA-7B-Thinking, a model that articulates its spatial reasoning. Our work highlights the importance of targeted data and suggests paths for improved temporal-spatial modeling. We release all resources to foster research in robust visuospatial intelligence.