CV AIApr 4, 2023

Towards Open-Vocabulary Video Instance Segmentation

Haochen Wang, Cilin Yan, Shuai Wang, Xiaolong Jiang, XU Tang, Yao Hu, Weidi Xie, Efstratios Gavves

arXiv:2304.01715v221.556 citationsh-index: 50Has Code

Originality Highly original

AI Analysis

This addresses the problem of handling novel object categories in real-world videos for computer vision applications, representing a novel task and dataset rather than an incremental improvement.

The paper tackles the limitation of Video Instance Segmentation (VIS) by introducing Open-Vocabulary VIS, which segments, tracks, and classifies objects from open-set categories including novel ones unseen during training, and achieves strong zero-shot generalization on a new large-scale dataset with 1,196 categories.

Video Instance Segmentation (VIS) aims at segmenting and categorizing objects in videos from a closed set of training categories, lacking the generalization ability to handle novel categories in real-world videos. To address this limitation, we make the following three contributions. First, we introduce the novel task of Open-Vocabulary Video Instance Segmentation, which aims to simultaneously segment, track, and classify objects in videos from open-set categories, including novel categories unseen during training. Second, to benchmark Open-Vocabulary VIS, we collect a Large-Vocabulary Video Instance Segmentation dataset (LV-VIS), that contains well-annotated objects from 1,196 diverse categories, significantly surpassing the category size of existing datasets by more than one order of magnitude. Third, we propose an efficient Memory-Induced Transformer architecture, OV2Seg, to first achieve Open-Vocabulary VIS in an end-to-end manner with near real-time inference speed. Extensive experiments on LV-VIS and four existing VIS datasets demonstrate the strong zero-shot generalization ability of OV2Seg on novel categories. The dataset and code are released here https://github.com/haochenheheda/LVVIS.

View on arXiv PDF Code

Similar