Boosting Unsupervised Video Instance Segmentation with Automatic Quality-Guided Self-Training
This addresses the problem of reducing annotation costs for video instance segmentation researchers, though it appears incremental as it builds on existing unsupervised methods like VideoCutLER.
The paper tackles the annotation challenges in Video Instance Segmentation by proposing AutoQ-VIS, an unsupervised framework that uses quality-guided self-training to bridge the synthetic-to-real domain gap, achieving state-of-the-art performance with 52.6 AP50 on YouTubeVIS-2019, a 4.4% improvement over previous methods.
Video Instance Segmentation (VIS) faces significant annotation challenges due to its dual requirements of pixel-level masks and temporal consistency labels. While recent unsupervised methods like VideoCutLER eliminate optical flow dependencies through synthetic data, they remain constrained by the synthetic-to-real domain gap. We present AutoQ-VIS, a novel unsupervised framework that bridges this gap through quality-guided self-training. Our approach establishes a closed-loop system between pseudo-label generation and automatic quality assessment, enabling progressive adaptation from synthetic to real videos. Experiments demonstrate state-of-the-art performance with 52.6 $\text{AP}_{50}$ on YouTubeVIS-2019 $\texttt{val}$ set, surpassing the previous state-of-the-art VideoCutLER by 4.4%, while requiring no human annotations. This demonstrates the viability of quality-aware self-training for unsupervised VIS. We will release the code at https://github.com/wcbup/AutoQ-VIS.