Unsupervised Multi-object Segmentation Using Attention and Soft-argmax
This work addresses the problem of unsupervised object-centric learning for multi-object detection and segmentation, which is incremental as it builds on existing methods with novel architectural components.
The paper tackles unsupervised multi-object segmentation by introducing an architecture that uses attention and soft-argmax to predict object coordinates and features, with a transformer encoder for handling occlusions and a convolutional autoencoder for background reconstruction, achieving significant state-of-the-art improvements on complex synthetic benchmarks.
We introduce a new architecture for unsupervised object-centric representation learning and multi-object detection and segmentation, which uses a translation-equivariant attention mechanism to predict the coordinates of the objects present in the scene and to associate a feature vector to each object. A transformer encoder handles occlusions and redundant detections, and a convolutional autoencoder is in charge of background reconstruction. We show that this architecture significantly outperforms the state of the art on complex synthetic benchmarks.