CVOct 31, 2023

Joint Depth Prediction and Semantic Segmentation with Multi-View SAM

Mykhailo Shvets, Dongxu Zhao, Marc Niethammer, Roni Sengupta, Alexander C. Berg

arXiv:2311.00134v18.413 citationsh-index: 7

Originality Incremental advance

AI Analysis

This work addresses the limitation of single-view predictions in robotics applications by providing a multi-view approach that improves both depth and segmentation tasks.

The paper tackles the problem of joint depth prediction and semantic segmentation by leveraging multiple views and the Segment Anything Model (SAM), resulting in consistent outperformance over single-task and monocular methods on the ScanNet dataset.

Multi-task approaches to joint depth and segmentation prediction are well-studied for monocular images. Yet, predictions from a single-view are inherently limited, while multiple views are available in many robotics applications. On the other end of the spectrum, video-based and full 3D methods require numerous frames to perform reconstruction and segmentation. With this work we propose a Multi-View Stereo (MVS) technique for depth prediction that benefits from rich semantic features of the Segment Anything Model (SAM). This enhanced depth prediction, in turn, serves as a prompt to our Transformer-based semantic segmentation decoder. We report the mutual benefit that both tasks enjoy in our quantitative and qualitative studies on the ScanNet dataset. Our approach consistently outperforms single-task MVS and segmentation models, along with multi-task monocular methods.

View on arXiv PDF

Similar