TAB: Transformer Attention Bottlenecks enable User Intervention and Debugging in Vision-Language Models
This addresses the need for more transparent and controllable AI systems in vision-language tasks, offering a novel method for user intervention, though it is incremental in its application to existing architectures.
The paper tackles the problem of interpretability and user intervention in vision-language models by introducing a Transformer Attention Bottleneck (TAB) layer, which constrains attention to enable debugging and editing, resulting in models that perform similarly to baselines in captioning but are superior in localizing changes and identifying no-change scenarios across three datasets.
Multi-head self-attention (MHSA) is a key component of Transformers, a widely popular architecture in both language and vision. Multiple heads intuitively enable different parallel processes over the same input. Yet, they also obscure the attribution of each input patch to the output of a model. We propose a novel 1-head Transformer Attention Bottleneck (TAB) layer, inserted after the traditional MHSA architecture, to serve as an attention bottleneck for interpretability and intervention. Unlike standard self-attention, TAB constrains the total attention over all patches to $\in [0, 1]$. That is, when the total attention is 0, no visual information is propagated further into the network, and the vision-language model (VLM) would default to a generic, image-independent response. To demonstrate the advantages of TAB, we train VLMs with TAB to perform image-difference captioning. Over three datasets, our models perform similarly to baseline VLMs in captioning but the bottleneck is superior in localizing changes and in identifying when no changes occur. TAB is the first architecture to enable users to debug by editing attention, which often produces expected outputs by VLMs.