UVOSAM: A Mask-free Paradigm for Unsupervised Video Object Segmentation via Segment Anything Model
This addresses the limitation of existing UVOS methods that require extensive mask annotations, offering a more flexible approach for video analysis tasks.
The paper tackles unsupervised video object segmentation (UVOS) by proposing UVOSAM, a mask-free paradigm that uses the Segment Anything Model (SAM) with STD-Net tracker for box prompts, achieving superior performance on DAVIS2017-unsupervised and YoutubeVIS datasets without mask supervision.
The current state-of-the-art methods for unsupervised video object segmentation (UVOS) require extensive training on video datasets with mask annotations, limiting their effectiveness in handling challenging scenarios. However, the Segment Anything Model (SAM) introduces a new prompt-driven paradigm for image segmentation, offering new possibilities. In this study, we investigate SAM's potential for UVOS through different prompt strategies. We then propose UVOSAM, a mask-free paradigm for UVOS that utilizes the STD-Net tracker. STD-Net incorporates a spatial-temporal decoupled deformable attention mechanism to establish an effective correlation between intra- and inter-frame features, remarkably enhancing the quality of box prompts in complex video scenes. Extensive experiments on the DAVIS2017-unsupervised and YoutubeVIS19\&21 datasets demonstrate the superior performance of UVOSAM without mask supervision compared to existing mask-supervised methods, as well as its ability to generalize to weakly-annotated video datasets. Code can be found at https://github.com/alibaba/UVOSAM.