CVMar 12, 2023

Towards Universal Vision-language Omni-supervised Segmentation

arXiv:2303.06547v11 citationsh-index: 103
Originality Incremental advance
AI Analysis

This work addresses the challenge of open-world segmentation for computer vision applications, offering an incremental improvement over existing methods.

The paper tackles the problem of open-world universal segmentation by proposing VLOSS, an end-to-end method that integrates omni-supervised data to improve segmentation accuracy, achieving a ~2% mask AP gain over MaskCLIP on the LVIS v1 dataset with fewer parameters.

Existing open-world universal segmentation approaches usually leverage CLIP and pre-computed proposal masks to treat open-world segmentation tasks as proposal classification. However, 1) these works cannot handle universal segmentation in an end-to-end manner, and 2) the limited scale of panoptic datasets restricts the open-world segmentation ability on things classes. In this paper, we present Vision-Language Omni-Supervised Segmentation (VLOSS). VLOSS starts from a Mask2Former universal segmentation framework with CLIP text encoder. To improve the open-world segmentation ability, we leverage omni-supervised data (i.e., panoptic segmentation data, object detection data, and image-text pairs data) into training, thus enriching the open-world segmentation ability and achieving better segmentation accuracy. To better improve the training efficiency and fully release the power of omni-supervised data, we propose several advanced techniques, i.e., FPN-style encoder, switchable training technique, and positive classification loss. Benefiting from the end-to-end training manner with proposed techniques, VLOSS can be applied to various open-world segmentation tasks without further adaptation. Experimental results on different open-world panoptic and instance segmentation benchmarks demonstrate the effectiveness of VLOSS. Notably, with fewer parameters, our VLOSS with Swin-Tiny backbone surpasses MaskCLIP by ~2% in terms of mask AP on LVIS v1 dataset.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes