Finding Distributed Object-Centric Properties in Self-Supervised Transformers

arXiv:2603.2612775.81 citationsh-index: 19
Predicted impact top 34% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses object localization and hallucination issues in vision and multimodal AI systems, offering an incremental improvement over existing methods.

The paper tackles the problem of poor object localization in self-supervised Vision Transformers by analyzing patch-level interactions, and introduces Object-DINO, a training-free method that improves unsupervised object discovery with gains of +3.6 to +12.4 CorLoc and mitigates object hallucination in Multimodal Large Language Models.

Self-supervised Vision Transformers (ViTs) like DINO show an emergent ability to discover objects, typically observed in [CLS] token attention maps of the final layer. However, these maps often contain spurious activations resulting in poor localization of objects. This is because the [CLS] token, trained on an image-level objective, summarizes the entire image instead of focusing on objects. This aggregation dilutes the object-centric information existing in the local, patch-level interactions. We analyze this by computing inter-patch similarity using patch-level attention components (query, key, and value) across all layers. We find that: (1) Object-centric properties are encoded in the similarity maps derived from all three components ($q, k, v$), unlike prior work that uses only key features or the [CLS] token. (2) This object-centric information is distributed across the network, not just confined to the final layer. Based on these insights, we introduce Object-DINO, a training-free method that extracts this distributed object-centric information. Object-DINO clusters attention heads across all layers based on the similarities of their patches and automatically identifies the object-centric cluster corresponding to all objects. We demonstrate Object-DINO's effectiveness on two applications: enhancing unsupervised object discovery (+3.6 to +12.4 CorLoc gains) and mitigating object hallucination in Multimodal Large Language Models by providing visual grounding. Our results demonstrate that using this distributed object-centric information improves downstream tasks without additional training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes