CVJun 8, 2024

Training-Free Robust Interactive Video Object Segmentation

arXiv:2406.05485v12 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of accurately segmenting objects across diverse domains in video tasks, which is crucial for applications like video editing and data annotation, though it is incremental as it builds on existing SAM capabilities.

The paper tackles the problem of interactive video object segmentation by proposing a training-free prompt tracking framework that leverages SAM, achieving robust zero-shot results on datasets like DAVIS 2017, YouTube-VOS 2018, and MOSE 2023 with a good tradeoff between performance and interaction time.

Interactive video object segmentation is a crucial video task, having various applications from video editing to data annotating. However, current approaches struggle to accurately segment objects across diverse domains. Recently, Segment Anything Model (SAM) introduces interactive visual prompts and demonstrates impressive performance across different domains. In this paper, we propose a training-free prompt tracking framework for interactive video object segmentation (I-PT), leveraging the powerful generalization of SAM. Although point tracking efficiently captures the pixel-wise information of objects in a video, points tend to be unstable when tracked over a long period, resulting in incorrect segmentation. Towards fast and robust interaction, we jointly adopt sparse points and boxes tracking, filtering out unstable points and capturing object-wise information. To better integrate reference information from multiple interactions, we introduce a cross-round space-time module (CRSTM), which adaptively aggregates mask features from previous rounds and frames, enhancing the segmentation stability. Our framework has demonstrated robust zero-shot video segmentation results on popular VOS datasets with interaction types, including DAVIS 2017, YouTube-VOS 2018, and MOSE 2023, maintaining a good tradeoff between performance and interaction time.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes