CVMar 19, 2025

Object-Centric Pretraining via Target Encoder Bootstrapping

Nikola Đukić, Tim Lebailly, Tinne Tuytelaars

arXiv:2503.15141v111.84 citationsh-index: 76Has CodeICLR

Originality Highly original

AI Analysis

This enables object-centric representation learning without relying on frozen pretrained foundation models, addressing a bottleneck for researchers in unsupervised computer vision.

The paper tackles the problem of training object-centric models from scratch on real-world data by proposing OCEBO, a self-distillation method that updates the target encoder as an exponential moving average of the object-centric model, achieving unsupervised object discovery performance comparable to models pretrained on hundreds of millions of images when trained on 241k COCO images.

Object-centric representation learning has recently been successfully applied to real-world datasets. This success can be attributed to pretrained non-object-centric foundation models, whose features serve as reconstruction targets for slot attention. However, targets must remain frozen throughout the training, which sets an upper bound on the performance object-centric models can attain. Attempts to update the target encoder by bootstrapping result in large performance drops, which can be attributed to its lack of object-centric inductive biases, causing the object-centric model's encoder to drift away from representations useful as reconstruction targets. To address these limitations, we propose Object-CEntric Pretraining by Target Encoder BOotstrapping, a self-distillation setup for training object-centric models from scratch, on real-world data, for the first time ever. In OCEBO, the target encoder is updated as an exponential moving average of the object-centric model, thus explicitly being enriched with object-centric inductive biases introduced by slot attention while removing the upper bound on performance present in other models. We mitigate the slot collapse caused by random initialization of the target encoder by introducing a novel cross-view patch filtering approach that limits the supervision to sufficiently informative patches. When pretrained on 241k images from COCO, OCEBO achieves unsupervised object discovery performance comparable to that of object-centric models with frozen non-object-centric target encoders pretrained on hundreds of millions of images. The code and pretrained models are publicly available at https://github.com/djukicn/ocebo.

View on arXiv PDF Code

Similar