CVLGMLJan 22, 2019

MONet: Unsupervised Scene Decomposition and Representation

arXiv:1901.11390v1605 citations
Originality Incremental advance
AI Analysis

This addresses the need for models that can discover useful scene decompositions to improve data efficiency and transfer in AI, though it is incremental as it builds on existing VAE and attention methods.

The paper tackles the problem of unsupervised scene decomposition into meaningful components like objects and backgrounds, and shows that the MONet model can learn to decompose and represent challenging 3D scenes semantically.

The ability to decompose scenes in terms of abstract building blocks is crucial for general intelligence. Where those basic building blocks share meaningful properties, interactions and other regularities across scenes, such decompositions can simplify reasoning and facilitate imagination of novel scenarios. In particular, representing perceptual observations in terms of entities should improve data efficiency and transfer performance on a wide range of tasks. Thus we need models capable of discovering useful decompositions of scenes by identifying units with such regularities and representing them in a common format. To address this problem, we have developed the Multi-Object Network (MONet). In this model, a VAE is trained end-to-end together with a recurrent attention network -- in a purely unsupervised manner -- to provide attention masks around, and reconstructions of, regions of images. We show that this model is capable of learning to decompose and represent challenging 3D scenes into semantically meaningful components, such as objects and background elements.

Code Implementations5 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes