Learned Equivariant Rendering without Transformation Supervision
This addresses the challenge of unsupervised object discovery and scene manipulation for computer vision applications, but appears incremental as it builds on existing equivariance and self-supervised learning ideas.
The paper tackles the problem of learning scene representations from video without supervision, automatically delineating objects and background by leveraging equivariance to transformations, and demonstrates real-time manipulation and rendering of unseen combinations on moving MNIST with backgrounds.
We propose a self-supervised framework to learn scene representations from video that are automatically delineated into objects and background. Our method relies on moving objects being equivariant with respect to their transformation across frames and the background being constant. After training, we can manipulate and render the scenes in real time to create unseen combinations of objects, transformations, and backgrounds. We show results on moving MNIST with backgrounds.