CVMar 5, 2025

Find First, Track Next: Decoupling Identification and Propagation in Referring Video Object Segmentation

arXiv:2503.03492v23 citationsh-index: 162025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)
Originality Incremental advance
AI Analysis

This work addresses a key bottleneck in video object segmentation for applications like video editing and autonomous systems, though it appears incremental as it builds on existing methods with a novel architectural approach.

The paper tackles the problem of ambiguous target identification and inconsistent mask propagation in referring video object segmentation by introducing FindTrack, a decoupled framework that separates target identification from mask propagation. It significantly outperforms existing methods on public benchmarks.

Referring video object segmentation aims to segment and track a target object in a video using a natural language prompt. Existing methods typically fuse visual and textual features in a highly entangled manner, processing multi-modal information together to generate per-frame masks. However, this approach often struggles with ambiguous target identification, particularly in scenes with multiple similar objects, and fails to ensure consistent mask propagation across frames. To address these limitations, we introduce FindTrack, an efficient decoupled framework that separates target identification from mask propagation. FindTrack first adaptively selects a key frame by balancing segmentation confidence and vision-text alignment, establishing a robust reference for the target object. This reference is then utilized by a dedicated propagation module to track and segment the object across the entire video. By decoupling these processes, FindTrack effectively reduces ambiguities in target association and enhances segmentation consistency. FindTrack significantly outperforms all existing methods on public benchmarks, demonstrating its superiority.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes