CVNov 29, 2021

MUNet: Motion Uncertainty-aware Semi-supervised Video Object Segmentation

arXiv:2111.14646v128 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of improving segmentation accuracy in videos for computer vision applications, offering a significant performance boost under low-data conditions, though it is incremental by building on existing dense matching-based methods.

The paper tackles the problem of semi-supervised video object segmentation by incorporating motion information, which previous methods ignored, and achieves a 76.5% J&F score on DAVIS17 using limited training data, significantly outperforming state-of-the-art methods.

The task of semi-supervised video object segmentation (VOS) has been greatly advanced and state-of-the-art performance has been made by dense matching-based methods. The recent methods leverage space-time memory (STM) networks and learn to retrieve relevant information from all available sources, where the past frames with object masks form an external memory and the current frame as the query is segmented using the mask information in the memory. However, when forming the memory and performing matching, these methods only exploit the appearance information while ignoring the motion information. In this paper, we advocate the return of the \emph{motion information} and propose a motion uncertainty-aware framework (MUNet) for semi-supervised VOS. First, we propose an implicit method to learn the spatial correspondences between neighboring frames, building upon a correlation cost volume. To handle the challenging cases of occlusion and textureless regions during constructing dense correspondences, we incorporate the uncertainty in dense matching and achieve motion uncertainty-aware feature representation. Second, we introduce a motion-aware spatial attention module to effectively fuse the motion feature with the semantic feature. Comprehensive experiments on challenging benchmarks show that \textbf{\textit{using a small amount of data and combining it with powerful motion information can bring a significant performance boost}}. We achieve ${76.5\%}$ $\mathcal{J} \& \mathcal{F}$ only using DAVIS17 for training, which significantly outperforms the \textit{SOTA} methods under the low-data protocol. \textit{The code will be released.}

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes