CVJul 29, 2025

Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos

arXiv:2507.22052v15 citationsh-index: 43
Originality Incremental advance
AI Analysis

This advances Spatial AI by enabling real-time, semantics-aware 3D reconstruction, though it appears incremental as it builds on existing CLIP and reconstruction methods.

The paper tackled the problem of open-vocabulary semantic 3D reconstruction from RGB videos by introducing Ov3R, a framework that integrates CLIP semantics directly into reconstruction, achieving state-of-the-art performance in dense 3D reconstruction and open-vocabulary 3D segmentation.

We present Ov3R, a novel framework for open-vocabulary semantic 3D reconstruction from RGB video streams, designed to advance Spatial AI. The system features two key components: CLIP3R, a CLIP-informed 3D reconstruction module that predicts dense point maps from overlapping clips while embedding object-level semantics; and 2D-3D OVS, a 2D-3D open-vocabulary semantic module that lifts 2D features into 3D by learning fused descriptors integrating spatial, geometric, and semantic cues. Unlike prior methods, Ov3R incorporates CLIP semantics directly into the reconstruction process, enabling globally consistent geometry and fine-grained semantic alignment. Our framework achieves state-of-the-art performance in both dense 3D reconstruction and open-vocabulary 3D segmentation, marking a step forward toward real-time, semantics-aware Spatial AI.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes