LGApr 29, 2025

Softpick: No Attention Sink, No Massive Activations with Rectified Softmax

arXiv:2504.20966v218 citationsh-index: 36Has Code
Originality Incremental advance
AI Analysis

This work addresses a specific bottleneck in transformer models, offering potential benefits for quantization, low-precision training, and sparsity optimization, but it is incremental as it modifies an existing component rather than introducing a new paradigm.

The paper tackles the issues of attention sink and massive activations in transformer attention mechanisms by introducing softpick, a rectified drop-in replacement for softmax, which achieves 0% sink rate and improves performance in quantized models, especially at lower bit precisions.

We introduce softpick, a rectified, not sum-to-one, drop-in replacement for softmax in transformer attention mechanisms that eliminates attention sink and massive activations. Our experiments with 340M and 1.8B parameter models demonstrate that softpick achieves 0\% sink rate consistently. The softpick transformers produce hidden states with significantly lower kurtosis and creates sparse attention maps. Quantized models using softpick outperform softmax on standard benchmarks, with a particularly pronounced advantage at lower bit precisions. Our analysis and discussion shows how softpick has the potential to open new possibilities for quantization, low-precision training, sparsity optimization, pruning, and interpretability. Our code is available at https://github.com/zaydzuhri/softpick-attention

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes