LGAIOCMLFeb 26, 2025

The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training

arXiv:2502.19002v221 citationsh-index: 10ICML
AI Analysis

This work addresses the efficiency of LLM training, offering a method to reduce computational costs, though it is incremental as it builds on existing optimizers like AdamW and Adam-mini.

The paper tackles the problem of accelerating large language model pre-training by identifying a sharpness disparity across transformer blocks and proposing a blockwise learning rate strategy, achieving nearly 2x speedup and lower terminal loss compared to vanilla AdamW across models like GPT-2 and LLaMA.

Transformers consist of diverse building blocks, such as embedding layers, normalization layers, self-attention mechanisms, and point-wise feedforward networks. Thus, understanding the differences and interactions among these blocks is important. In this paper, we uncover a clear Sharpness Disparity across these blocks, which emerges early in training and intriguingly persists throughout the training process. Motivated by this finding, we propose Blockwise Learning Rate (LR), a strategy that tailors the LR to each block's sharpness, accelerating large language model (LLM) pre-training. By integrating Blockwise LR into AdamW, we consistently achieve lower terminal loss and nearly $2\times$ speedup compared to vanilla AdamW. We demonstrate this acceleration across GPT-2 and LLaMA, with model sizes ranging from 0.12B to 2B and datasets of OpenWebText, MiniPile, and C4. Finally, we incorporate Blockwise LR into Adam-mini (Zhang et al., 2024), a recently proposed memory-efficient variant of Adam, achieving a combined $2\times$ speedup and $2\times$ memory saving. These results underscore the potential of exploiting the sharpness disparity to improve LLM training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes