CVSep 15, 2024

Disentangling Visual Priors: Unsupervised Learning of Scene Interpretations with Compositional Autoencoder

arXiv:2409.09716v15.22 citationsh-index: 2

Originality Incremental advance

AI Analysis

This addresses the problem of improving scene interpretation in computer vision, but it appears incremental as it builds on existing neurosymbolic and autoencoder methods.

The authors tackled the problem of deep learning architectures lacking principled ways to capture visual concepts like objects and geometric transforms by proposing a neurosymbolic architecture that uses a domain-specific language to learn scene interpretations. They demonstrated its capacity to disentangle image formation aspects, learn from small data, correct noise, and generalize out-of-sample on a synthetic benchmark.

Contemporary deep learning architectures lack principled means for capturing and handling fundamental visual concepts, like objects, shapes, geometric transforms, and other higher-level structures. We propose a neurosymbolic architecture that uses a domain-specific language to capture selected priors of image formation, including object shape, appearance, categorization, and geometric transforms. We express template programs in that language and learn their parameterization with features extracted from the scene by a convolutional neural network. When executed, the parameterized program produces geometric primitives which are rendered and assessed for correspondence with the scene content and trained via auto-association with gradient. We confront our approach with a baseline method on a synthetic benchmark and demonstrate its capacity to disentangle selected aspects of the image formation process, learn from small data, correct inference in the presence of noise, and out-of-sample generalization.

View on arXiv PDF

Similar