Beyond Linearity in Attention Projections: The Case for Nonlinear Queries
This addresses a potential inefficiency in transformer architectures for NLP and AI, though it appears incremental as it builds on existing algebraic insights.
The paper tackles the problem of linear query projections in transformers by proposing a nonlinear residual form, showing consistent improvement over baselines and outperforming a model with 12.5% more parameters in experiments on GPT-3 small style models.
Recent algebraic analysis shows that in decoder-only and encoder-only transformers, the Query projection $W_Q$ may be set to identity without noticeable performance deterioration. This is possible because attention depends on $X$ only through the products $XW_Q, XW_K, XW_V$, allowing basis transformations to be absorbed by adjacent layers and propagated through the network. We replace $W_Q \in \mathbb{R}^{d \times d}$ with a nonlinear residual of the form $Q(X) = X + f_θ(X)$, where $f_θ$ is a bottleneck MLP with $d^2 + O(d)$ parameters. The identity term anchors the nonlinearity to a known-good prior. Experiments on GPT-3 small style models show consistent improvement over the baseline, comfortably outperforming a model with 12.5% more non-embedding parameters. These results motivate investigation at larger scales and across modalities.