CVMar 23, 2021

Weakly Supervised Instance Segmentation for Videos with Temporal Mask Consistency

Qing Liu, Vignesh Ramanathan, Dhruv Mahajan, Alan Yuille, Zhenheng Yang

arXiv:2103.12886v112.126 citationsh-index: 130

Originality Incremental advance

AI Analysis

This work addresses annotation cost reduction for video instance segmentation, but it is incremental as it builds on existing weakly supervised methods by incorporating video-specific signals.

The paper tackles the problem of partial segmentation and missing object predictions in weakly supervised instance segmentation by using videos instead of images, leveraging motion and temporal consistency to improve performance, resulting in a 5% increase in AP50 on Youtube-VIS and 3% on Cityscapes.

Weakly supervised instance segmentation reduces the cost of annotations required to train models. However, existing approaches which rely only on image-level class labels predominantly suffer from errors due to (a) partial segmentation of objects and (b) missing object predictions. We show that these issues can be better addressed by training with weakly labeled videos instead of images. In videos, motion and temporal consistency of predictions across frames provide complementary signals which can help segmentation. We are the first to explore the use of these video signals to tackle weakly supervised instance segmentation. We propose two ways to leverage this information in our model. First, we adapt inter-pixel relation network (IRN) to effectively incorporate motion information during training. Second, we introduce a new MaskConsist module, which addresses the problem of missing object instances by transferring stable predictions between neighboring frames during training. We demonstrate that both approaches together improve the instance segmentation metric $AP_{50}$ on video frames of two datasets: Youtube-VIS and Cityscapes by $5\%$ and $3\%$ respectively.

View on arXiv PDF

Similar