CVJun 20, 2025

Co-VisiON: Co-Visibility ReasONing on Sparse Image Sets of Indoor Scenes

Chao Chen, Nobel Dang, Juexiao Zhang, Wenkai Sun, Pengfei Zheng, Xuhang He, Yimeng Ye, Jiasheng Zhang, Taarun Srinivas, Chen Feng

arXiv:2506.16805v32 citationsh-index: 92025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

Originality Incremental advance

AI Analysis

This addresses the challenge of developing vision models with human-like spatial reasoning for applications in 3D vision and robotic perception, though it is incremental in proposing a new benchmark and baseline.

The paper tackles the problem of co-visibility reasoning in sparse image sets of indoor scenes, introducing the Co-VisiON benchmark with over 1,000 scenarios and showing that all tested models, including a proprietary vision-language model, fall significantly short of human performance, though their proposed Covis baseline achieves top results among vision-only models.

Humans exhibit a remarkable ability to recognize co-visibility-the 3D regions simultaneously visible in multiple images-even when these images are sparsely distributed across a complex scene. This ability is foundational to 3D vision, robotic perception, and relies not only on low-level feature matching but also on high-level spatial reasoning and cognitive integration. Yet, it remains unclear whether current vision models can replicate this human-level proficiency. In this work, we introduce the Co-VisiON benchmark, designed to evaluate human-inspired co-visibility reasoning across more than 1,000 sparse-view indoor scenarios. Our results show that while co-visibility is often approached as a low-level feature-matching task, it remains challenging for existing vision models under sparse conditions. Notably, a proprietary vision-language model surpasses all vision-only baselines, but all models fall significantly short of human performance. This gap underscores the limitations of current architectures and motivates the need for models that integrate spatial and semantic information in a human-like manner. Inspired by human visual cognition, we propose a novel multi-view baseline, Covis, which achieves top performance among pure vision models and narrows the gap to the proprietary VLM. We hope our benchmark and findings will spur further advancements in developing vision models capable of robust, cognitively inspired reasoning in challenging, sparse environments. Our dataset and source code can be found at https://ai4ce.github.io/CoVISION.

View on arXiv PDF

Similar