CVDec 15, 2022

Solve the Puzzle of Instance Segmentation in Videos: A Weakly Supervised Framework with Spatio-Temporal Collaboration

Liqi Yan, Qifan Wang, Siqi Ma, Jingang Wang, Changbin Yu

arXiv:2212.07592v110.673 citationsh-index: 24

Originality Incremental advance

AI Analysis

This work addresses the problem of reducing annotation costs for video instance segmentation, which is important for researchers and practitioners in computer vision, though it is incremental as it builds on existing weakly supervised methods.

The paper tackles instance segmentation in videos by proposing a weakly supervised framework called STC-Seg, which uses pseudo-labels from depth and flow, a puzzle loss for box-level training, and a tracking module with spatio-temporal collaboration, achieving strong performance that outperforms fully supervised methods like TrackR-CNN and MaskTrack R-CNN on KITTI MOTS and YT-VIS datasets.

Instance segmentation in videos, which aims to segment and track multiple objects in video frames, has garnered a flurry of research attention in recent years. In this paper, we present a novel weakly supervised framework with \textbf{S}patio-\textbf{T}emporal \textbf{C}ollaboration for instance \textbf{Seg}mentation in videos, namely \textbf{STC-Seg}. Concretely, STC-Seg demonstrates four contributions. First, we leverage the complementary representations from unsupervised depth estimation and optical flow to produce effective pseudo-labels for training deep networks and predicting high-quality instance masks. Second, to enhance the mask generation, we devise a puzzle loss, which enables end-to-end training using box-level annotations. Third, our tracking module jointly utilizes bounding-box diagonal points with spatio-temporal discrepancy to model movements, which largely improves the robustness to different object appearances. Finally, our framework is flexible and enables image-level instance segmentation methods to operate the video-level task. We conduct an extensive set of experiments on the KITTI MOTS and YT-VIS datasets. Experimental results demonstrate that our method achieves strong performance and even outperforms fully supervised TrackR-CNN and MaskTrack R-CNN. We believe that STC-Seg can be a valuable addition to the community, as it reflects the tip of an iceberg about the innovative opportunities in the weakly supervised paradigm for instance segmentation in videos.

View on arXiv PDF

Similar