Softpick: No Attention Sink, No Massive Activations with Rectified Softmax
This work addresses a specific bottleneck in transformer models, offering potential benefits for quantization, low-precision training, and sparsity optimization, but it is incremental as it modifies an existing component rather than introducing a new paradigm.
The paper tackles the issues of attention sink and massive activations in transformer attention mechanisms by introducing softpick, a rectified drop-in replacement for softmax, which achieves 0% sink rate and improves performance in quantized models, especially at lower bit precisions.
We introduce softpick, a rectified, not sum-to-one, drop-in replacement for softmax in transformer attention mechanisms that eliminates attention sink and massive activations. Our experiments with 340M and 1.8B parameter models demonstrate that softpick achieves 0\% sink rate consistently. The softpick transformers produce hidden states with significantly lower kurtosis and creates sparse attention maps. Quantized models using softpick outperform softmax on standard benchmarks, with a particularly pronounced advantage at lower bit precisions. Our analysis and discussion shows how softpick has the potential to open new possibilities for quantization, low-precision training, sparsity optimization, pruning, and interpretability. Our code is available at https://github.com/zaydzuhri/softpick-attention