CVJun 2, 2021

Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation

arXiv:2106.01061v260 citations
Originality Incremental advance
AI Analysis

This addresses the problem of segmenting video objects based on language references for computer vision applications, representing an incremental improvement over previous methods.

The paper tackles referring video object segmentation by proposing a two-stage, top-down approach that first constructs object tracklets and then uses a Transformer-based module for grounding, achieving first place on the CVPR2021 Referring Youtube-VOS challenge.

Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference. Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice. Such bottom-up strategy fails to explore object-level cues, easily leading to inferior results. In this work, we instead put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video. Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently. Our model ranks first place on CVPR2021 Referring Youtube-VOS challenge.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes