CVJul 8, 2025

Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion

arXiv:2507.06230v27 citationsh-index: 7Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of reducing annotation costs for 3D scene understanding in computer vision, though it is incremental as it adapts existing self-supervised techniques to a new task.

The paper tackles unsupervised semantic scene completion from single images by proposing SceneDINO, which uses multi-view consistency self-supervision without ground-truth annotations, achieving state-of-the-art segmentation accuracy and matching supervised methods in linear probing.

Semantic scene completion (SSC) aims to infer both the 3D geometry and semantics of a scene from single images. In contrast to prior work on SSC that heavily relies on expensive ground-truth annotations, we approach SSC in an unsupervised setting. Our novel method, SceneDINO, adapts techniques from self-supervised representation learning and 2D unsupervised scene understanding to SSC. Our training exclusively utilizes multi-view consistency self-supervision without any form of semantic or geometric ground truth. Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner. Through a novel 3D feature distillation approach, we obtain unsupervised 3D semantics. In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy. Linear probing our 3D features matches the segmentation accuracy of a current supervised SSC approach. Additionally, we showcase the domain generalization and multi-view consistency of SceneDINO, taking the first steps towards a strong foundation for single image 3D scene understanding.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes