Decomposing Query-Key Feature Interactions Using Contrastive Covariances
This provides interpretability tools for Transformer models, addressing a key bottleneck in understanding attention mechanisms, though it is incremental as it builds on existing analysis methods.
The paper tackles the problem of understanding why attention heads in Transformers attend to specific tokens by decomposing the query-key space into low-rank, interpretable components using a contrastive covariance method, resulting in the identification of human-interpretable subspaces for categorical semantic and binding features in large language models.
Despite the central role of attention heads in Transformers, we lack tools to understand why a model attends to a particular token. To address this, we study the query-key (QK) space -- the bilinear joint embedding space between queries and keys. We present a contrastive covariance method to decompose the QK space into low-rank, human-interpretable components. It is when features in keys and queries align in these low-rank subspaces that high attention scores are produced. We first study our method both analytically and empirically in a simplified setting. We then apply our method to large language models to identify human-interpretable QK subspaces for categorical semantic features and binding features. Finally, we demonstrate how attention scores can be attributed to our identified features.