CVApr 10, 2025

Learning Object Focused Attention

arXiv:2504.08166v1h-index: 58ICPR
Originality Incremental advance
AI Analysis

This work addresses the issue of ViTs not effectively modeling objects for computer vision researchers, offering an incremental improvement with no inference overhead.

The authors tackled the problem of Vision Transformers (ViTs) lacking explicit object modeling by proposing an object-focused attention (OFA) loss that restricts attention to intra-object patches, resulting in improved classification, stronger generalization to out-of-distribution and adversarial images, and representations based on object shapes rather than textures.

We propose an adaptation to the training of Vision Transformers (ViTs) that allows for an explicit modeling of objects during the attention computation. This is achieved by adding a new branch to selected attention layers that computes an auxiliary loss which we call the object-focused attention (OFA) loss. We restrict the attention to image patches that belong to the same object class, which allows ViTs to gain a better understanding of configural (or holistic) object shapes by focusing on intra-object patches instead of other patches such as those in the background. Our proposed inductive bias fits easily into the attention framework of transformers since it only adds an auxiliary loss over selected attention layers. Furthermore, our approach has no additional overhead during inference. We also experiment with multiscale masking to further improve the performance of our OFA model and give a path forward for self-supervised learning with our method. Our experimental results demonstrate that ViTs with OFA achieve better classification results than their base models, exhibit a stronger generalization ability to out-of-distribution (OOD) and adversarially corrupted images, and learn representations based on object shapes rather than spurious correlations via general textures. For our OOD setting, we generate a novel dataset using the COCO dataset and Stable Diffusion inpainting which we plan to share with the community.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes