LGAIFeb 21

Rank-Aware Spectral Bounds on Attention Logits for Stable Low-Precision Training

arXiv:2602.18851v1
Originality Incremental advance
AI Analysis

This addresses overflow issues in low-precision training for large transformer models, offering a principled solution that is incremental but improves stability without sacrificing performance.

The paper tackles the problem of overflow risk in low-precision training of transformers by deriving a rank-aware concentration inequality for attention scores, which yields 8-28x tighter bounds than previous methods, and applies this to develop geometry-aware scale factors that eliminate overflows in models like GPT-2 XL to Llama-2-70B while maintaining comparable accuracy.

Attention scores in transformers are bilinear forms $S_{ij} = x_i^\top M x_j / \sqrt{d_h}$ whose maximum magnitude governs overflow risk in low-precision training. We derive a \emph{rank-aware concentration inequality}: when the interaction matrix $M = W^Q W^{K\top}$ has rank $r \ll d$, tail probabilities for $\max_{i,j}|S_{ij}|$ decay as $\exp(-d^{2}α^{2}/(γr))$ rather than $\exp(-dα^{2})$, where $γ> 1$ is a typicality parameter. For transformer attention where $r = d_h$, this yields $8$--$28\times$ tighter concentration than rank-agnostic bounds in modern architectures. We apply this result to FP8 training, deriving \emph{geometry-aware scale factors} that provide principled overflow guarantees without observing activations. The method computes per-layer scales from the spectral norm $\|W^Q W^{K\top}\|_2$ via implicit power iteration, includes a grouped query attention formulation that avoids key expansion, and remains compatible with fused attention kernels. Across GPT-2 XL to Llama-2-70B, geometry-aware scaling eliminates overflows in transient scenarios where delayed scaling fails, while achieving comparable downstream MMLU accuracy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes