FlashNorm: fast normalization for LLMs
This work addresses a performance bottleneck in LLMs for researchers and practitioners, but it is incremental as it optimizes existing methods without introducing new paradigms.
The paper tackles the computational inefficiency of normalization layers like RMSNorm in large language models by introducing FlashNorm, an exact but faster implementation that also reduces parameter count by merging normalization weights with subsequent linear layers, achieving speedups for RMSNorm, Layer Normalization, and Dynamic Tanh.
This paper presents FlashNorm, which is an exact but faster implementation of RMSNorm followed by linear layers. RMSNorm is used by many LLMs such as Llama, Mistral, and OpenELM. FlashNorm also speeds up Layer Normalization and its recently proposed replacement Dynamic Tanh (DyT) arXiv:2503.10622. FlashNorm also reduces the number of parameter tensors by simply merging the normalization weights with the weights of the next linear layer. See https://github.com/OpenMachine-ai/transformer-tricks for code and more transformer tricks.