CVAIApr 4, 2024

Dissecting Query-Key Interaction in Vision Transformers

arXiv:2405.14880v410 citationsh-index: 4NIPS
Originality Incremental advance
AI Analysis

This provides a novel perspective for interpreting attention mechanisms in vision transformers, aiding in understanding how these models use context and features, though it is incremental as it builds on existing analysis methods.

The paper analyzes query-key interactions in vision transformers using singular value decomposition, finding that early layers attend to similar tokens for perceptual grouping while late layers attend to dissimilar tokens for contextualization, with interpretable semantic patterns like object parts or foreground-background attention.

Self-attention in vision transformers is often thought to perform perceptual grouping where tokens attend to other tokens with similar embeddings, which could correspond to semantically similar features of an object. However, attending to dissimilar tokens can be beneficial by providing contextual information. We propose to analyze the query-key interaction by the singular value decomposition of the interaction matrix (i.e. ${\textbf{W}_q}^\top\textbf{W}_k$). We find that in many ViTs, especially those with classification training objectives, early layers attend more to similar tokens, while late layers show increased attention to dissimilar tokens, providing evidence corresponding to perceptual grouping and contextualization, respectively. Many of these interactions between features represented by singular vectors are interpretable and semantic, such as attention between relevant objects, between parts of an object, or between the foreground and background. This offers a novel perspective on interpreting the attention mechanism, which contributes to understanding how transformer models utilize context and salient features when processing images.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes