GLASS: Guided Latent Slot Diffusion for Object-Centric Learning
This work addresses the problem of decomposing complex real-world images into object representations for AI systems, representing an incremental advance in slot-attention methods.
The paper tackles the challenge of object-centric learning on real-world datasets with complex scenes by introducing GLASS, a novel slot-attention model that uses guided diffusion to improve slot embeddings, resulting in state-of-the-art performance on tasks like object discovery and conditional image generation.
Object-centric learning aims to decompose an input image into a set of meaningful object files (slots). These latent object representations enable a variety of downstream tasks. Yet, object-centric learning struggles on real-world datasets, which contain multiple objects of complex textures and shapes in natural everyday scenes. To address this, we introduce Guided Latent Slot Diffusion (GLASS), a novel slot-attention model that learns in the space of generated images and uses semantic and instance guidance modules to learn better slot embeddings for various downstream tasks. Our experiments show that GLASS surpasses state-of-the-art slot-attention methods by a wide margin on tasks such as (zero-shot) object discovery and conditional image generation for real-world scenes. Moreover, GLASS enables the first application of slot attention to the compositional generation of complex, realistic scenes.