Find First, Track Next: Decoupling Identification and Propagation in Referring Video Object Segmentation
This work addresses a key bottleneck in video object segmentation for applications like video editing and autonomous systems, though it appears incremental as it builds on existing methods with a novel architectural approach.
The paper tackles the problem of ambiguous target identification and inconsistent mask propagation in referring video object segmentation by introducing FindTrack, a decoupled framework that separates target identification from mask propagation. It significantly outperforms existing methods on public benchmarks.
Referring video object segmentation aims to segment and track a target object in a video using a natural language prompt. Existing methods typically fuse visual and textual features in a highly entangled manner, processing multi-modal information together to generate per-frame masks. However, this approach often struggles with ambiguous target identification, particularly in scenes with multiple similar objects, and fails to ensure consistent mask propagation across frames. To address these limitations, we introduce FindTrack, an efficient decoupled framework that separates target identification from mask propagation. FindTrack first adaptively selects a key frame by balancing segmentation confidence and vision-text alignment, establishing a robust reference for the target object. This reference is then utilized by a dedicated propagation module to track and segment the object across the entire video. By decoupling these processes, FindTrack effectively reduces ambiguities in target association and enhances segmentation consistency. FindTrack significantly outperforms all existing methods on public benchmarks, demonstrating its superiority.