CL LGApr 12

Position-Agnostic Pre-Projection for Transformer Attention: Nonlinear Feature Construction and Content Skip Before Q/K/V

arXiv:2604.1079159.2

Predicted impact top 98% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For transformer-based language models, this work offers a simple, cache-friendly way to improve attention by incorporating position-agnostic features and content bypass, with consistent benefits across model sizes.

The paper introduces two modifications to transformer attention: a non-linear pre-projection MLP before Q/K/V and a content skip connection that bypasses attention. On Pythia-160M, the combined method achieves +40.6% LAMBADA accuracy and -39% perplexity improvement.

We propose two complementary modifications to transformer attention blocks. First, a non-linear pre-projection MLP is inserted between layer norm and Q/K/V projections, constructing richer features in a position-agnostic manner before any positional encoding is applied. Second, a content skip connection routes the pre-projection's features around the attention mechanism, allowing content information to bypass position-aware attention where beneficial. In frozen-probe experiments on Pythia-160M and 410M, the combined approach achieves the strongest results across methods: +40.6% LAMBADA accuracy and -39% perplexity at 160M scale. Learned skip connection weights reveal a consistent pattern across model sizes: later transformer layers activate the content bypass more strongly than earlier layers, suggesting that deeper layers benefit from content information that does not pass through positional attention. All modifications add no K/V cache overhead.

View on arXiv PDF

Similar