CVLGMar 9, 2021

Self-Supervision by Prediction for Object Discovery in Videos

arXiv:2103.05669v17 citations
Originality Incremental advance
AI Analysis

This addresses the problem of reducing reliance on annotated data for video analysis, though it appears incremental in the context of self-supervised learning.

The paper tackles unsupervised object discovery in videos by proposing a self-supervised model that uses prediction tasks to disentangle objects and motion dynamics, with initial experiments showing it as a promising step without manual annotations.

Despite their irresistible success, deep learning algorithms still heavily rely on annotated data. On the other hand, unsupervised settings pose many challenges, especially about determining the right inductive bias in diverse scenarios. One scalable solution is to make the model generate the supervision for itself by leveraging some part of the input data, which is known as self-supervised learning. In this paper, we use the prediction task as self-supervision and build a novel object-centric model for image sequence representation. In addition to disentangling the notion of objects and the motion dynamics, our compositional structure explicitly handles occlusion and inpaints inferred objects and background for the composition of the predicted frame. With the aid of auxiliary loss functions that promote spatially and temporally consistent object representations, our self-supervised framework can be trained without the help of any manual annotation or pretrained network. Initial experiments confirm that the proposed pipeline is a promising step towards object-centric video prediction.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes