CVJun 26, 2025

PanSt3R: Multi-view Consistent Panoptic Segmentation

Lojze Zust, Yohann Cabon, Juliette Marrie, Leonid Antsfeld, Boris Chidlovskii, Jerome Revaud, Gabriela Csurka

arXiv:2506.21348v119.716 citationsh-index: 28

Originality Incremental advance

AI Analysis

This addresses the challenge of efficient and accurate 3D scene understanding for applications like robotics and AR/VR, though it builds incrementally on existing 3D reconstruction methods.

The paper tackles the problem of panoptic segmentation in 3D scenes from unposed 2D images by proposing PanSt3R, a unified approach that jointly predicts 3D geometry and multi-view panoptic segmentation in a single forward pass, eliminating test-time optimization and achieving state-of-the-art performance while being orders of magnitude faster.

Panoptic segmentation of 3D scenes, involving the segmentation and classification of object instances in a dense 3D reconstruction of a scene, is a challenging problem, especially when relying solely on unposed 2D images. Existing approaches typically leverage off-the-shelf models to extract per-frame 2D panoptic segmentations, before optimizing an implicit geometric representation (often based on NeRF) to integrate and fuse the 2D predictions. We argue that relying on 2D panoptic segmentation for a problem inherently 3D and multi-view is likely suboptimal as it fails to leverage the full potential of spatial relationships across views. In addition to requiring camera parameters, these approaches also necessitate computationally expensive test-time optimization for each scene. Instead, in this work, we propose a unified and integrated approach PanSt3R, which eliminates the need for test-time optimization by jointly predicting 3D geometry and multi-view panoptic segmentation in a single forward pass. Our approach builds upon recent advances in 3D reconstruction, specifically upon MUSt3R, a scalable multi-view version of DUSt3R, and enhances it with semantic awareness and multi-view panoptic segmentation capabilities. We additionally revisit the standard post-processing mask merging procedure and introduce a more principled approach for multi-view segmentation. We also introduce a simple method for generating novel-view predictions based on the predictions of PanSt3R and vanilla 3DGS. Overall, the proposed PanSt3R is conceptually simple, yet fast and scalable, and achieves state-of-the-art performance on several benchmarks, while being orders of magnitude faster than existing methods.

View on arXiv PDF

Similar