An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L
This addresses interpretation challenges for researchers analyzing transformer models, but it is incremental as it builds on prior work on memory management.
The study identified specific attention heads that erase earlier outputs in a 4-layer transformer, providing evidence for memory management, and showed that direct logit attribution can be misleading due to this erasure.
Prior work suggests that language models manage the limited bandwidth of the residual stream through a "memory management" mechanism, where certain attention heads and MLP layers clear residual stream directions set by earlier layers. Our study provides concrete evidence for this erasure phenomenon in a 4-layer transformer, identifying heads that consistently remove the output of earlier heads. We further demonstrate that direct logit attribution (DLA), a common technique for interpreting the output of intermediate transformer layers, can show misleading results by not accounting for erasure.