CVMar 23, 2021

Weakly Supervised Instance Segmentation for Videos with Temporal Mask Consistency

arXiv:2103.12886v126 citations
Originality Incremental advance
AI Analysis

This work addresses annotation cost reduction for video instance segmentation, but it is incremental as it builds on existing weakly supervised methods by incorporating video-specific signals.

The paper tackles the problem of partial segmentation and missing object predictions in weakly supervised instance segmentation by using videos instead of images, leveraging motion and temporal consistency to improve performance, resulting in a 5% increase in AP50 on Youtube-VIS and 3% on Cityscapes.

Weakly supervised instance segmentation reduces the cost of annotations required to train models. However, existing approaches which rely only on image-level class labels predominantly suffer from errors due to (a) partial segmentation of objects and (b) missing object predictions. We show that these issues can be better addressed by training with weakly labeled videos instead of images. In videos, motion and temporal consistency of predictions across frames provide complementary signals which can help segmentation. We are the first to explore the use of these video signals to tackle weakly supervised instance segmentation. We propose two ways to leverage this information in our model. First, we adapt inter-pixel relation network (IRN) to effectively incorporate motion information during training. Second, we introduce a new MaskConsist module, which addresses the problem of missing object instances by transferring stable predictions between neighboring frames during training. We demonstrate that both approaches together improve the instance segmentation metric $AP_{50}$ on video frames of two datasets: Youtube-VIS and Cityscapes by $5\%$ and $3\%$ respectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes