Tracking Anything with Decoupled Video Segmentation
This addresses the problem of expensive video data annotation for researchers and practitioners in computer vision, enabling more efficient extensions to new segmentation tasks, though it is incremental as it builds on existing segmentation methods.
The paper tackles the high cost of video segmentation annotation by proposing DEVA, a decoupled approach that uses task-specific image-level segmentation and a universal temporal propagation model, achieving competitive performance in data-scarce tasks like large-vocabulary video panoptic segmentation and open-world video segmentation.
Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation. Code is available at: https://hkchengrex.github.io/Tracking-Anything-with-DEVA