CVJul 8, 2025

Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion

Aleksandar Jevtić, Christoph Reich, Felix Wimbauer, Oliver Hahn, Christian Rupprecht, Stefan Roth, Daniel Cremers

arXiv:2507.06230v214.47 citationsh-index: 7Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of reducing annotation costs for 3D scene understanding in computer vision, though it is incremental as it adapts existing self-supervised techniques to a new task.

The paper tackles unsupervised semantic scene completion from single images by proposing SceneDINO, which uses multi-view consistency self-supervision without ground-truth annotations, achieving state-of-the-art segmentation accuracy and matching supervised methods in linear probing.

Semantic scene completion (SSC) aims to infer both the 3D geometry and semantics of a scene from single images. In contrast to prior work on SSC that heavily relies on expensive ground-truth annotations, we approach SSC in an unsupervised setting. Our novel method, SceneDINO, adapts techniques from self-supervised representation learning and 2D unsupervised scene understanding to SSC. Our training exclusively utilizes multi-view consistency self-supervision without any form of semantic or geometric ground truth. Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner. Through a novel 3D feature distillation approach, we obtain unsupervised 3D semantics. In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy. Linear probing our 3D features matches the segmentation accuracy of a current supervised SSC approach. Additionally, we showcase the domain generalization and multi-view consistency of SceneDINO, taking the first steps towards a strong foundation for single image 3D scene understanding.

View on arXiv PDF Code

Similar