CVMar 17, 2023

Unified Mask Embedding and Correspondence Learning for Self-Supervised Video Segmentation

arXiv:2303.10100v128 citationsh-index: 77
Originality Incremental advance
AI Analysis

This work addresses the problem of reducing annotation costs for video object segmentation, making it more accessible, though it is incremental as it builds on prior self-supervised efforts.

The paper tackles self-supervised video object segmentation by developing a unified framework that learns cross-frame correspondence and object-level mask embedding from unlabeled videos, achieving state-of-the-art results on DAVIS17 and YouTube-VOS benchmarks and narrowing the gap with fully supervised methods.

The objective of this paper is self-supervised learning of video object segmentation. We develop a unified framework which simultaneously models cross-frame dense correspondence for locally discriminative feature learning and embeds object-level context for target-mask decoding. As a result, it is able to directly learn to perform mask-guided sequential segmentation from unlabeled videos, in contrast to previous efforts usually relying on an oblique solution - cheaply "copying" labels according to pixel-wise correlations. Concretely, our algorithm alternates between i) clustering video pixels for creating pseudo segmentation labels ex nihilo; and ii) utilizing the pseudo labels to learn mask encoding and decoding for VOS. Unsupervised correspondence learning is further incorporated into this self-taught, mask embedding scheme, so as to ensure the generic nature of the learnt representation and avoid cluster degeneracy. Our algorithm sets state-of-the-arts on two standard benchmarks (i.e., DAVIS17 and YouTube-VOS), narrowing the gap between self- and fully-supervised VOS, in terms of both performance and network architecture design.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes