CVCLAug 17, 2022

Understanding Attention for Vision-and-Language Tasks

arXiv:2208.08104v2582 citationsh-index: 26Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses the need for better interpretability and performance in attention mechanisms for vision-and-language tasks, though it is incremental as it focuses on analysis rather than introducing a new method.

The paper tackles the problem of understanding how different attention alignment calculations bridge the semantic gap between visual and textual features in vision-and-language tasks, analyzing their interpretability and impact on model performance across tasks like visual question answering and text-to-image generation, with code made available.

Attention mechanism has been used as an important component across Vision-and-Language(VL) tasks in order to bridge the semantic gap between visual and textual features. While attention has been widely used in VL tasks, it has not been examined the capability of different attention alignment calculation in bridging the semantic gap between visual and textual clues. In this research, we conduct a comprehensive analysis on understanding the role of attention alignment by looking into the attention score calculation methods and check how it actually represents the visual region's and textual token's significance for the global assessment. We also analyse the conditions which attention score calculation mechanism would be more (or less) interpretable, and which may impact the model performance on three different VL tasks, including visual question answering, text-to-image generation, text-and-image matching (both sentence and image retrieval). Our analysis is the first of its kind and provides useful insights of the importance of each attention alignment score calculation when applied at the training phase of VL tasks, commonly ignored in attention-based cross modal models, and/or pretrained models. Our code is available at: https://github.com/adlnlp/Attention_VL

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes