CVLGMar 29, 2021

Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers

arXiv:2103.15679v1467 citations
Originality Incremental advance
AI Analysis

This addresses the need for interpretability in multi-modal AI systems, which is crucial for trust and debugging, though it is incremental as it extends explainability to more complex Transformer variants.

The authors tackled the problem of explaining predictions in Transformer-based architectures, including bi-modal and encoder-decoder models, by proposing the first generic method for such architectures, showing it outperforms existing single-modality explainability methods.

Transformers are increasingly dominating multi-modal reasoning tasks, such as visual question answering, achieving state-of-the-art results thanks to their ability to contextualize information using the self-attention and co-attention mechanisms. These attention modules also play a role in other computer vision tasks including object detection and image segmentation. Unlike Transformers that only use self-attention, Transformers with co-attention require to consider multiple attention maps in parallel in order to highlight the information that is relevant to the prediction in the model's input. In this work, we propose the first method to explain prediction by any Transformer-based architecture, including bi-modal Transformers and Transformers with co-attentions. We provide generic solutions and apply these to the three most commonly used of these architectures: (i) pure self-attention, (ii) self-attention combined with co-attention, and (iii) encoder-decoder attention. We show that our method is superior to all existing methods which are adapted from single modality explainability.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes