Influence Malleability in Linearized Attention: Dual Implications of Non-Convergent NTK Dynamics

arXiv:2603.130859.8h-index: 4
AI Analysis

This work addresses a fundamental theoretical problem in understanding attention mechanisms for machine learning researchers, highlighting a trade-off between performance and robustness that is incremental but clarifies prior empirical observations.

The paper reveals that linearized attention does not converge to its infinite-width NTK limit, even at large widths, due to a spectral amplification effect requiring impractical widths for convergence, and shows attention has 6–9× higher influence malleability than ReLU networks, leading to both reduced approximation error and increased vulnerability to adversarial manipulation.

Understanding the theoretical foundations of attention mechanisms remains challenging due to their complex, non-linear dynamics. This work reveals a fundamental trade-off in the learning dynamics of linearized attention. Using a linearized attention mechanism with exact correspondence to a data-dependent Gram-induced kernel, both empirical and theoretical analysis through the Neural Tangent Kernel (NTK) framework shows that linearized attention does not converge to its infinite-width NTK limit, even at large widths. A spectral amplification result establishes this formally: the attention transformation cubes the Gram matrix's condition number, requiring width $m = Ω(κ^6)$ for convergence, a threshold that exceeds any practical width for natural image datasets. This non-convergence is characterized through influence malleability, the capacity to dynamically alter reliance on training examples. Attention exhibits 6--9$\times$ higher malleability than ReLU networks, with dual implications: its data-dependent kernel can reduce approximation error by aligning with task structure, but this same sensitivity increases susceptibility to adversarial manipulation of training data. These findings suggest that attention's power and vulnerability share a common origin in its departure from the kernel regime.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes