Compositional Visual Generation and Inference with Energy Based Models
This addresses the challenge of compositional reasoning in AI for computer vision tasks, offering a novel approach to concept combination.
The paper tackles the problem of compositional visual generation and inference by showing that energy-based models can directly combine probability distributions to create compositions of concepts, enabling generation of images satisfying conjunctions, disjunctions, and negations of concepts. They evaluated on CelebA faces and synthetic 3D scenes, demonstrating capabilities like continual learning and concept property inference.
A vital aspect of human intelligence is the ability to compose increasingly complex concepts out of simpler ideas, enabling both rapid learning and adaptation of knowledge. In this paper we show that energy-based models can exhibit this ability by directly combining probability distributions. Samples from the combined distribution correspond to compositions of concepts. For example, given a distribution for smiling faces, and another for male faces, we can combine them to generate smiling male faces. This allows us to generate natural images that simultaneously satisfy conjunctions, disjunctions, and negations of concepts. We evaluate compositional generation abilities of our model on the CelebA dataset of natural faces and synthetic 3D scene images. We also demonstrate other unique advantages of our model, such as the ability to continually learn and incorporate new concepts, or infer compositions of concept properties underlying an image.