Interactive Video Object Segmentation in the Wild
This work addresses the need for efficient human-in-the-loop annotation in video object segmentation, offering a practical solution for users in computer vision applications, though it is incremental as it builds on existing interactive segmentation methods.
The paper tackles the problem of interactive video object segmentation by combining a one-shot video segmentation backbone with a deep interactive image segmentation method that requires only a few clicks, achieving 90% IOU with 3.8 clicks on average on the GrabCut dataset.
In this paper we present our system for human-in-the-loop video object segmentation. The backbone of our system is a method for one-shot video object segmentation. While fast, this method requires an accurate pixel-level segmentation of one (or several) frames as input. As manually annotating such a segmentation is impractical, we propose a deep interactive image segmentation method, that can accurately segment objects with only a handful of clicks. On the GrabCut dataset, our method obtains 90% IOU with just 3.8 clicks on average, setting the new state of the art. Furthermore, as our method iteratively refines an initial segmentation, it can effectively correct frames where the video object segmentation fails, thus allowing users to quickly obtain high quality results even on challenging sequences. Finally, we investigate usage patterns and give insights in how many steps users take to annotate frames, what kind of corrections they provide, etc., thus giving important insights for further improving interactive video segmentation.