CLDec 22, 2024

Reversed Attention: On The Gradient Descent Of Attention Layers In GPT

arXiv:2412.17019v114 citationsh-index: 5NAACL
Originality Incremental advance
AI Analysis

This work addresses the problem of limited interpretability in attention mechanisms for researchers and practitioners, offering incremental insights into backpropagation dynamics.

The paper tackles the overlooked backward pass of attention in Transformer-based language models, revealing a 'Reversed Attention' matrix that helps explain model behavior and enables direct editing of attention without weight changes, demonstrated through a novel 'attention patching' method.

The success of Transformer-based Language Models (LMs) stems from their attention mechanism. While this mechanism has been extensively studied in explainability research, particularly through the attention values obtained during the forward pass of LMs, the backward pass of attention has been largely overlooked. In this work, we study the mathematics of the backward pass of attention, revealing that it implicitly calculates an attention matrix we refer to as "Reversed Attention". We examine the properties of Reversed Attention and demonstrate its ability to elucidate the models' behavior and edit dynamics. In an experimental setup, we showcase the ability of Reversed Attention to directly alter the forward pass of attention, without modifying the model's weights, using a novel method called "attention patching". In addition to enhancing the comprehension of how LM configure attention layers during backpropagation, Reversed Attention maps contribute to a more interpretable backward pass.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes