CVMar 23, 2023

Top-Down Visual Attention from Analysis by Synthesis

arXiv:2303.13043v242 citationsh-index: 156
Originality Incremental advance
AI Analysis

This addresses the need for task-adaptive attention in intelligent agents, offering a novel approach that is incremental in combining existing concepts.

The paper tackles the problem of stimulus-driven attention in vision models by proposing a top-down attention mechanism based on Analysis-by-Synthesis, which improves performance on vision-language tasks like VQA and zero-shot retrieval, and enhances classification, segmentation, and robustness.

Current attention algorithms (e.g., self-attention) are stimulus-driven and highlight all the salient objects in an image. However, intelligent agents like humans often guide their attention based on the high-level task at hand, focusing only on task-related objects. This ability of task-guided top-down attention provides task-adaptive representation and helps the model generalize to various tasks. In this paper, we consider top-down attention from a classic Analysis-by-Synthesis (AbS) perspective of vision. Prior work indicates a functional equivalence between visual attention and sparse reconstruction; we show that an AbS visual system that optimizes a similar sparse reconstruction objective modulated by a goal-directed top-down signal naturally simulates top-down attention. We further propose Analysis-by-Synthesis Vision Transformer (AbSViT), which is a top-down modulated ViT model that variationally approximates AbS, and achieves controllable top-down attention. For real-world applications, AbSViT consistently improves over baselines on Vision-Language tasks such as VQA and zero-shot retrieval where language guides the top-down attention. AbSViT can also serve as a general backbone, improving performance on classification, semantic segmentation, and model robustness.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes