CVAICLNov 8, 2023

Zero-shot Translation of Attention Patterns in VQA Models to Natural Language

arXiv:2311.05043v14 citationsh-index: 20Has Code
Originality Incremental advance
AI Analysis

This provides interpretable insights for VQA researchers by enabling human-understandable explanations of model behavior, though it is incremental as it builds on existing zero-shot and LLM techniques.

The paper tackles the problem of translating transformer attention patterns in Visual Question Answering (VQA) models into natural language without training, achieving state-of-the-art zero-shot performance on GQA-REX and VQA-X datasets.

Converting a model's internals to text can yield human-understandable insights about the model. Inspired by the recent success of training-free approaches for image captioning, we propose ZS-A2T, a zero-shot framework that translates the transformer attention of a given model into natural language without requiring any training. We consider this in the context of Visual Question Answering (VQA). ZS-A2T builds on a pre-trained large language model (LLM), which receives a task prompt, question, and predicted answer, as inputs. The LLM is guided to select tokens which describe the regions in the input image that the VQA model attended to. Crucially, we determine this similarity by exploiting the text-image matching capabilities of the underlying VQA model. Our framework does not require any training and allows the drop-in replacement of different guiding sources (e.g. attribution instead of attention maps), or language models. We evaluate this novel task on textual explanation datasets for VQA, giving state-of-the-art performances for the zero-shot setting on GQA-REX and VQA-X. Our code is available at: https://github.com/ExplainableML/ZS-A2T.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes