Measuring the Mixing of Contextual Information in the Transformer
This addresses the need for better interpretability in Transformer models, particularly for researchers and practitioners in NLP, though it is incremental as it builds on existing attribution methods.
The paper tackled the problem of understanding how contextual information mixes across Transformer layers by proposing ALTI, a metric that measures token-to-token interactions across attention blocks, resulting in more faithful explanations and increased robustness compared to gradient-based methods.
The Transformer architecture aggregates input information through the self-attention mechanism, but there is no clear understanding of how this information is mixed across the entire model. Additionally, recent works have demonstrated that attention weights alone are not enough to describe the flow of information. In this paper, we consider the whole attention block -- multi-head attention, residual connection, and layer normalization -- and define a metric to measure token-to-token interactions within each layer. Then, we aggregate layer-wise interpretations to provide input attribution scores for model predictions. Experimentally, we show that our method, ALTI (Aggregation of Layer-wise Token-to-token Interactions), provides more faithful explanations and increased robustness than gradient-based methods.