82.6NAApr 2
Attention Mechanisms Through the Lens of Numerical Methods: Approximation Methods and Alternative FormulationsMichel Fabrice Serret, Alice Cortinovis, Yijun Dong et al.
The attention mechanism is the computational core of modern Transformer architectures, but its quadratic complexity in the input sequence length is the bottleneck for large-scale inference. This has motivated a rapidly growing body of work aimed at accelerating attention through approximation and reformulation. In this survey, we revisit attention mechanisms through the lens of numerical analysis, with a particular emphasis on tools and perspectives from numerical linear algebra. Our goal is twofold: first, we aim to systematically review and classify fast approximation methods according to the numerical principles they exploit. These include sparsity and clustering approaches, low-rank and subspace projection techniques, randomized sketching methods, and tensor-based decompositions. We also discuss kernel-inspired reformulations of attention and recent architectural variants, such as Latent Attention, that modify the standard softmax formulation to improve efficiency. Second, by presenting these developments within a unified mathematical framework, we aim to bridge the gap between disciplines and highlight opportunities for further contributions from computational mathematics, particularly numerical linear algebra, to the design of scalable attention mechanisms.
10.4NAMay 8
Kernel-based linear system identification using augmented Krylov subspacesFabio Matti, Martin Skovgaard Andersen, Tianshi Chen et al.
We propose a novel Krylov subspace method for estimating the finite impulse response (FIR) of a one-dimensional linear time-invariant systems. The method approximates the system's FIR using a kernel-based formulation combined with hyperparameter selection based on maximum likelihood estimation (MLE), which requires repeated evaluation of two terms: The data fit $\boldsymbol{y}^{\top} (λ\boldsymbol{I} + \boldsymbol{A})^{-1} \boldsymbol{y}$ and the model complexity $\log(\det (λ\boldsymbol{I} + \boldsymbol{A}))$, where $\boldsymbol{A}$ is a certain positive semidefinite matrix that admits fast matrix--vector products and $λ> 0$ is a regularization parameter. Instead of approximating these two quantities separately, we jointly approximate them using a single augmented Krylov subspace for $\boldsymbol{A}$. One major benefit of augmentation is that we obtain accelerated convergence when approximating the data fit quadratic form, through implicit preconditioning. Thanks to the shift invariance of Krylov subspaces, the extracted approximations can be used to evaluate the MLE objective for many values of $λ$ at little additional cost. We derive error bounds for the approximations, reflecting the benefits of augmentation demonstrated through multiple numerical experiments.