CVAug 8, 2022

Two-Stream Networks for Object Segmentation in Videos

arXiv:2208.04026v11 citationsh-index: 103
Originality Incremental advance
AI Analysis

This work improves video object segmentation accuracy for computer vision applications, but it is incremental as it builds on existing matching-based approaches.

The paper tackles the problem of video object segmentation by addressing unseen pixels that lack correspondence in memory, proposing a Two-Stream Network that fuses pixel and instance streams with a routing map, achieving state-of-the-art performance of 86.1% on YouTube-VOS 2018 and 87.5% on DAVIS-2017.

Existing matching-based approaches perform video object segmentation (VOS) via retrieving support features from a pixel-level memory, while some pixels may suffer from lack of correspondence in the memory (i.e., unseen), which inevitably limits their segmentation performance. In this paper, we present a Two-Stream Network (TSN). Our TSN includes (i) a pixel stream with a conventional pixel-level memory, to segment the seen pixels based on their pixellevel memory retrieval. (ii) an instance stream for the unseen pixels, where a holistic understanding of the instance is obtained with dynamic segmentation heads conditioned on the features of the target instance. (iii) a pixel division module generating a routing map, with which output embeddings of the two streams are fused together. The compact instance stream effectively improves the segmentation accuracy of the unseen pixels, while fusing two streams with the adaptive routing map leads to an overall performance boost. Through extensive experiments, we demonstrate the effectiveness of our proposed TSN, and we also report state-of-the-art performance of 86.1% on YouTube-VOS 2018 and 87.5% on the DAVIS-2017 validation split.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes