CVMar 6

JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas

Sandeep Inuganti, Hideaki Kanayama, Kanta Shimizu, Mahdi Chamseddine, Soichiro Yokota, Didier Stricker, Jason Rambach

arXiv:2603.06168v111.0h-index: 16

Predicted impact top 39% in CV · last 90 daysOriginality Highly original

AI Analysis

This work addresses the problem of limited annotated data and fixed-label models in semantic segmentation across visual modalities, which is a challenge for researchers and practitioners working with multi-modal scene understanding.

This paper introduces JOPP-3D, an open-vocabulary semantic segmentation framework that jointly uses panoramic images and 3D point clouds to enable language-driven scene understanding. It achieves significant improvement over state-of-the-art methods in both open and closed vocabulary 2D and 3D semantic segmentation on the Stanford-2D-3D-s and ToF-360 datasets.

Semantic segmentation across visual modalities such as 3D point clouds and panoramic images remains a challenging task, primarily due to the scarcity of annotated data and the limited adaptability of fixed-label models. In this paper, we present JOPP-3D, an open-vocabulary semantic segmentation framework that jointly leverages panoramic and point cloud data to enable language-driven scene understanding. We convert RGB-D panoramic images into their corresponding tangential perspective images and 3D point clouds, then use these modalities to extract and align foundational vision-language features. This allows natural language querying to generate semantic masks on both input modalities. Experimental evaluation on the Stanford-2D-3D-s and ToF-360 datasets demonstrates the capability of JOPP-3D to produce coherent and semantically meaningful segmentations across panoramic and 3D domains. Our proposed method achieves a significant improvement compared to the SOTA in open and closed vocabulary 2D and 3D semantic segmentation.

View on arXiv PDF

Similar