CVAICLLGROMay 18, 2025

Visuospatial Cognitive Assistant

arXiv:2505.12312v45 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work addresses the problem of visuospatial reasoning for robotics and AI applications, offering incremental improvements through targeted datasets and models.

The paper tackles the challenge of video-based spatial cognition for robotics and embodied AI by introducing ViCA-322K, a dataset of 322,003 QA pairs from real-world indoor videos, and ViCA-7B, a fine-tuned model that achieves new state-of-the-art on all eight VSI-Bench tasks, with improvements like +26.1 on Absolute Distance.

Video-based spatial cognition is vital for robotics and embodied AI but challenges current Vision-Language Models (VLMs). This paper makes two key contributions. First, we introduce ViCA (Visuospatial Cognitive Assistant)-322K, a diverse dataset of 322,003 QA pairs from real-world indoor videos (ARKitScenes, ScanNet, ScanNet++), offering supervision for 3D metadata-grounded queries and video-based complex reasoning. Second, we develop ViCA-7B, fine-tuned on ViCA-322K, which achieves new state-of-the-art on all eight VSI-Bench tasks, outperforming existing models, including larger ones (e.g., +26.1 on Absolute Distance). For interpretability, we present ViCA-Thinking-2.68K, a dataset with explicit reasoning chains, and fine-tune ViCA-7B to create ViCA-7B-Thinking, a model that articulates its spatial reasoning. Our work highlights the importance of targeted data and suggests paths for improved temporal-spatial modeling. We release all resources to foster research in robust visuospatial intelligence.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes