Design Pseudo Ground Truth with Motion Cue for Unsupervised Video Object Segmentation
This addresses the need for efficient, unsupervised video object segmentation, offering a more efficient approach than existing methods, though it appears incremental as it builds on instance segmentation networks and motion cues.
The paper tackles the problem of labeling object masks for training in video object segmentation by proposing a method to generate inexpensive, high-quality pseudo ground truth using motion cues, which improves segmentation performance and outperforms state-of-the-art unsupervised methods on datasets like DAVIS and FBMS.
One major technique debt in video object segmentation is to label the object masks for training instances. As a result, we propose to prepare inexpensive, yet high quality pseudo ground truth corrected with motion cue for video object segmentation training. Our method conducts semantic segmentation using instance segmentation networks and, then, selects the segmented object of interest as the pseudo ground truth based on the motion information. Afterwards, the pseudo ground truth is exploited to finetune the pretrained objectness network to facilitate object segmentation in the remaining frames of the video. We show that the pseudo ground truth could effectively improve the segmentation performance. This straightforward unsupervised video object segmentation method is more efficient than existing methods. Experimental results on DAVIS and FBMS show that the proposed method outperforms state-of-the-art unsupervised segmentation methods on various benchmark datasets. And the category-agnostic pseudo ground truth has great potential to extend to multiple arbitrary object tracking.