CVApr 14

Cross-Attentive Multiview Fusion of Vision-Language Embeddings

arXiv:2604.1255174.3h-index: 4
Predicted impact top 37% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For researchers in 3D scene understanding, this work provides a novel method to fuse multiview vision-language descriptors, improving 3D open-vocabulary segmentation.

The paper tackles the problem of lifting vision-language models from 2D to 3D for open-vocabulary semantic segmentation. It introduces a cross-attentive multiview fusion architecture that outperforms naive averaging and achieves state-of-the-art results on 3D semantic and instance classification benchmarks, including zero-shot evaluations.

Vision-language models have been key to the development of open-vocabulary 2D semantic segmentation. Lifting these models from 2D images to 3D scenes, however, remains a challenging problem. Existing approaches typically back-project and average 2D descriptors across views, or heuristically select a single representative one, often resulting in suboptimal 3D representations. In this work, we introduce a novel multiview transformer architecture that cross-attends across vision-language descriptors from multiple viewpoints and fuses them into a unified per-3D-instance embedding. As a second contribution, we leverage multiview consistency as a self-supervision signal for this fusion, which significantly improves performance when added to a standard supervised target-class loss. Our Cross-Attentive Multiview Fusion, which we denote with its acronym CAMFusion, not only consistently outperforms naive averaging or single-view descriptor selection, but also achieves state-of-the-art results on 3D semantic and instance classification benchmarks, including zero-shot evaluations on out-of-domain datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes