AutoQ-VIS: Improving Unsupervised Video Instance Segmentation via Automatic Quality Assessment
This addresses the problem of costly manual annotation for video instance segmentation researchers, though it builds incrementally on existing unsupervised approaches.
The paper tackles the annotation challenges in Video Instance Segmentation by introducing AutoQ-VIS, an unsupervised framework that uses quality-guided self-training to bridge the synthetic-to-real domain gap, achieving state-of-the-art performance of 52.6 AP50 on YouTubeVIS-2019 val set, a 4.4% improvement over previous methods without human annotations.
Video Instance Segmentation (VIS) faces significant annotation challenges due to its dual requirements of pixel-level masks and temporal consistency labels. While recent unsupervised methods like VideoCutLER eliminate optical flow dependencies through synthetic data, they remain constrained by the synthetic-to-real domain gap. We present AutoQ-VIS, a novel unsupervised framework that bridges this gap through quality-guided self-training. Our approach establishes a closed-loop system between pseudo-label generation and automatic quality assessment, enabling progressive adaptation from synthetic to real videos. Experiments demonstrate state-of-the-art performance with 52.6 $\text{AP}_{50}$ on YouTubeVIS-2019 val set, surpassing the previous state-of-the-art VideoCutLER by 4.4$\%$, while requiring no human annotations. This demonstrates the viability of quality-aware self-training for unsupervised VIS. The source code of our method is available at https://github.com/wcbup/AutoQ-VIS.