Joint Depth Prediction and Semantic Segmentation with Multi-View SAM
This work addresses the limitation of single-view predictions in robotics applications by providing a multi-view approach that improves both depth and segmentation tasks.
The paper tackles the problem of joint depth prediction and semantic segmentation by leveraging multiple views and the Segment Anything Model (SAM), resulting in consistent outperformance over single-task and monocular methods on the ScanNet dataset.
Multi-task approaches to joint depth and segmentation prediction are well-studied for monocular images. Yet, predictions from a single-view are inherently limited, while multiple views are available in many robotics applications. On the other end of the spectrum, video-based and full 3D methods require numerous frames to perform reconstruction and segmentation. With this work we propose a Multi-View Stereo (MVS) technique for depth prediction that benefits from rich semantic features of the Segment Anything Model (SAM). This enhanced depth prediction, in turn, serves as a prompt to our Transformer-based semantic segmentation decoder. We report the mutual benefit that both tasks enjoy in our quantitative and qualitative studies on the ScanNet dataset. Our approach consistently outperforms single-task MVS and segmentation models, along with multi-task monocular methods.