CV AI CL LG ROMay 18, 2025

Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts

arXiv:2505.12363v420.411 citationsh-index: 2Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of fine-grained spatial reasoning for AI systems, representing an incremental improvement with a compact model.

The paper tackles the challenge of visuospatial cognition in multimodal large language models by introducing ViCA2, which achieves a state-of-the-art average score of 56.8 on the VSI-Bench benchmark, outperforming larger models like LLaVA-NeXT-Video-72B (40.9) and Gemini-1.5 Pro (45.4).

While Multimodal Large Language Models (MLLMs) excel at general vision-language tasks, visuospatial cognition - reasoning about spatial layouts, relations, and dynamics - remains a significant challenge. Existing models often lack the necessary architectural components and specialized training data for fine-grained spatial understanding. We introduce ViCA2 (Visuospatial Cognitive Assistant 2), a novel MLLM designed to enhance spatial reasoning. ViCA2 features a dual vision encoder architecture integrating SigLIP for semantics and Hiera for spatial structure, coupled with a token ratio control mechanism for efficiency. We also developed ViCA-322K, a new large-scale dataset with over 322,000 spatially grounded question-answer pairs for targeted instruction tuning. On the challenging VSI-Bench benchmark, our ViCA2-7B model achieves a state-of-the-art average score of 56.8, significantly surpassing larger open-source models (e.g., LLaVA-NeXT-Video-72B, 40.9) and leading proprietary models (Gemini-1.5 Pro, 45.4). This demonstrates the effectiveness of our approach in achieving strong visuospatial intelligence with a compact model. We release ViCA2, its codebase, and the ViCA-322K dataset to facilitate further research.

View on arXiv PDF Code

Similar