Compressible Softmax-Attended Language under Incompressible Attention
This reveals a fundamental compressibility property of language data in transformers, which could inform more efficient model designs, though it is incremental as it builds on existing attention analysis.
The study found that in transformer language models, the logit energy field reaches 90% of its variance in only 2-11 singular components, while the learned interaction matrix requires 38-75 components, indicating a 5-25x spectral gap and that language interactions are highly compressible compared to the attention mechanism's uniform capacity allocation.
Across every attention head in five transformer language models (124M--7B parameters, four architecture families), the logit energy field $\tilde{E}$ reaches 90\% of its variance in 2--11 singular components. The \emph{learned} interaction matrix $W_Q^\mathrm{T} W_K$ needs 38--75 components for the same threshold out of $d_h \in \{64, 128\}$. The spectral gap is $5$--$25\times$ in effective rank. The attention mechanism allocates capacity uniformly across all $d_h$ dimensions, but language concentrates the actual interaction into a few. The compressibility of softmax-attended language is a property of the data, not the frame that analyzes it.