Appearance-Based Refinement for Object-Centric Motion Segmentation
This addresses the challenge of segmenting and tracking moving objects in complex videos for applications like video analysis, but it is incremental as it refines existing flow-based approaches.
The paper tackles the problem of imperfect motion segmentation from optical flow by introducing an appearance-based refinement method that uses temporal consistency to correct flow-based proposals, achieving competitive performance on single-object segmentation and significantly outperforming existing models on multi-object segmentation across benchmarks like DAVIS and YouTubeVOS.
The goal of this paper is to discover, segment, and track independently moving objects in complex visual scenes. Previous approaches have explored the use of optical flow for motion segmentation, leading to imperfect predictions due to partial motion, background distraction, and object articulations and interactions. To address this issue, we introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals. Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars, and an object-centric architecture that refines problematic masks based on exemplar information. The model is pre-trained on synthetic data and then adapted to real-world videos in a self-supervised manner, eliminating the need for human annotations. Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTubeVOS, SegTrackv2, and FBMS-59. We achieve competitive performance on single-object segmentation, while significantly outperforming existing models on the more challenging problem of multi-object segmentation. Finally, we investigate the benefits of using our model as a prompt for the per-frame Segment Anything Model.