Attention Projection Mixing and Exogenous Anchors
This addresses a fundamental architectural issue in Transformers for machine learning researchers, offering incremental improvements in efficiency and performance.
The paper tackles the tension in Transformers where early-layer attention projections must serve as both stable references and computational blocks by proposing ExoFormer, which uses exogenous anchor projections to decouple these roles, resulting in a 2.13-point increase in downstream accuracy and 1.84x fewer tokens for matching baseline validation loss.
Transformers that reuse early-layer attention projections as residuals face a fundamental tension: the first layer must simultaneously serve as a stable reference for all deeper layers and as an effective computational block. To resolve this, we propose ExoFormer, which learns dedicated exogenous anchor projections outside the sequential layer stack, decoupling the anchor role from computational refinement. Through a unified normalized mixing framework (studying different coefficient granularities: elementwise, headwise, scalar) across all attention pathways (queries, keys, values, and gate logits), ExoFormer variants consistently outperform their internal-anchor counterparts. Moreover, the dynamic variant achieves a 2.13-point increase in downstream accuracy over the baseline and demonstrates superior data efficiency, matching baseline validation loss with 1.84x fewer tokens. ExoFormer also achieves a 2x reduction in attention sink compared to standard Gated Attention. Paradoxically, all ExoFormer variants exhibit signs of representation collapse. We explain this via an Offloading Hypothesis: external anchors preserve essential token identity, allowing layers to specialize exclusively in computational refinement. We release codes and models to facilitate future research.