CVMar 21, 2023

Two-shot Video Object Segmentation

Kun Yan, Xiao Li, Fangyun Wei, Jinglu Wang, Chenbin Zhang, Ping Wang, Yan Lu

arXiv:2303.12078v19.127 citationsh-index: 32Has Code

Originality Incremental advance

AI Analysis

This work addresses the annotation cost problem for video object segmentation researchers and practitioners, offering an incremental improvement in training efficiency.

The paper tackles the problem of expensive pixel-level annotation in video object segmentation by proposing a two-shot training paradigm that requires only two labeled frames per video, achieving comparable results to fully supervised methods using just 7.3% and 2.9% of labeled data on YouTube-VOS and DAVIS benchmarks.

Previous works on video object segmentation (VOS) are trained on densely annotated videos. Nevertheless, acquiring annotations in pixel level is expensive and time-consuming. In this work, we demonstrate the feasibility of training a satisfactory VOS model on sparsely annotated videos-we merely require two labeled frames per training video while the performance is sustained. We term this novel training paradigm as two-shot video object segmentation, or two-shot VOS for short. The underlying idea is to generate pseudo labels for unlabeled frames during training and to optimize the model on the combination of labeled and pseudo-labeled data. Our approach is extremely simple and can be applied to a majority of existing frameworks. We first pre-train a VOS model on sparsely annotated videos in a semi-supervised manner, with the first frame always being a labeled one. Then, we adopt the pre-trained VOS model to generate pseudo labels for all unlabeled frames, which are subsequently stored in a pseudo-label bank. Finally, we retrain a VOS model on both labeled and pseudo-labeled data without any restrictions on the first frame. For the first time, we present a general way to train VOS models on two-shot VOS datasets. By using 7.3% and 2.9% labeled data of YouTube-VOS and DAVIS benchmarks, our approach achieves comparable results in contrast to the counterparts trained on fully labeled set. Code and models are available at https://github.com/yk-pku/Two-shot-Video-Object-Segmentation.

View on arXiv PDF Code

Similar