Segment Anything Across Shots: A Method and Benchmark
It addresses a practical limitation in video segmentation for real-world applications by enabling cross-shot generalization, though it is incremental as it builds on existing VOS methods.
This work tackles the problem of multi-shot semi-supervised video object segmentation (MVOS), where existing methods struggle with shot discontinuities, and proposes the SAAS model with a transition mimicking augmentation strategy, achieving state-of-the-art performance on benchmarks like YouMVOS and Cut-VOS.
This work focuses on multi-shot semi-supervised video object segmentation (MVOS), which aims at segmenting the target object indicated by an initial mask throughout a video with multiple shots. The existing VOS methods mainly focus on single-shot videos and struggle with shot discontinuities, thereby limiting their real-world applicability. We propose a transition mimicking data augmentation strategy (TMA) which enables cross-shot generalization with single-shot data to alleviate the severe annotated multi-shot data sparsity, and the Segment Anything Across Shots (SAAS) model, which can detect and comprehend shot transitions effectively. To support evaluation and future study in MVOS, we introduce Cut-VOS, a new MVOS benchmark with dense mask annotations, diverse object categories, and high-frequency transitions. Extensive experiments on YouMVOS and Cut-VOS demonstrate that the proposed SAAS achieves state-of-the-art performance by effectively mimicking, understanding, and segmenting across complex transitions. The code and datasets are released at https://henghuiding.com/SAAS/.