CVOct 10, 2023

Self-supervised Object-Centric Learning for Videos

arXiv:2310.06907v122.052 citationsh-index: 43

Originality Incremental advance

AI Analysis

This addresses the problem of segmenting multiple objects in real-world video sequences without supervision, which is incremental as it builds on prior work but focuses on challenging real-world scenarios.

The paper tackles unsupervised multi-object segmentation in real-world videos by proposing a fully unsupervised object-centric learning framework that spatially binds objects to slots and relates them across frames, achieving successful segmentation of multiple instances of complex and high-variety classes in YouTube videos.

Unsupervised multi-object segmentation has shown impressive results on images by utilizing powerful semantics learned from self-supervised pretraining. An additional modality such as depth or motion is often used to facilitate the segmentation in video sequences. However, the performance improvements observed in synthetic sequences, which rely on the robustness of an additional cue, do not translate to more challenging real-world scenarios. In this paper, we propose the first fully unsupervised method for segmenting multiple objects in real-world sequences. Our object-centric learning framework spatially binds objects to slots on each frame and then relates these slots across frames. From these temporally-aware slots, the training objective is to reconstruct the middle frame in a high-level semantic feature space. We propose a masking strategy by dropping a significant portion of tokens in the feature space for efficiency and regularization. Additionally, we address over-clustering by merging slots based on similarity. Our method can successfully segment multiple instances of complex and high-variety classes in YouTube videos.

View on arXiv PDF

Similar