Attention Sinks and Outliers in Attention Residuals
For practitioners deploying transformer models with AttnResidual layers, OASIS improves inference stability and quantization robustness.
OASIS reduces attention sinks and activation outliers in AttnResidual architectures via inter-layer null signaling, achieving 9.26% lower max infinity norm, 2.60% lower kurtosis, 75.85% perplexity reduction under W8A8, and 12.42% GSM8K improvement under W4A4.
We propose OASIS, an outlier- and sink-aware technique built on inter-layer null signaling. As AttnResidual architectures introduce an additional depth-wise normalization channel, they improve inter-layer routing flexibility but also exacerbate attention sinks, activation outliers, and the resulting degradation in inference stability and quantization robustness. OASIS addresses this issue by introducing a Softmax1-based null space and coupling token-level null evidence to depth routing through an inter-layer null signal, thereby reducing sink-dominated routing and improving structural robustness. Theoretically, we show that the dual-normalization design of AttnResidual intensifies sink formation and quantization brittleness. Experimentally, we compare OASIS against five baselines on three real-world datasets and observe consistent improvements in both attention sink and post-quantization performance. Notably, OASIS achieves an average reduction of 9.26% in maximum infinity norm and 2.60% in average kurtosis across the evaluated settings, while lowering perplexity by 75.85% under W8A8 and improving GSM8K Pass@1 by 12.42% under W4A4.