LGCLJun 27, 2025

GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling

arXiv:2506.22049v22 citationsh-index: 11Has Code
Originality Incremental advance
AI Analysis

This addresses a stability issue in large language model pretraining, offering a versatile method to improve training dynamics, though it appears incremental as it builds on existing approaches.

The paper tackles the problem of exponential activation variance growth in Pre-LayerNorm Transformers, which limits learning in deeper layers, by proposing Gradient-Preserving Activation Scaling (GPAS) to scale down activations while preserving gradients, achieving consistent performance gains across model sizes from 71M to 1B.

Modern Large Language Models, such as the LLaMA, Qwen and DeepSeek series, predominantly adopt the Pre-LayerNorm (Pre-LN) Transformer architecture. While being stable during pretraining and scalable to large model sizes, Pre-LN suffers from an exponential growth in activation variance across layers, causing the shortcut to dominate over sub-layer outputs in the residual connection and limiting the learning capacity of deeper layers. To mitigate this issue, we propose Gradient-Preserving Activation Scaling (GPAS), a simple technique that can be used in combination with existing approaches. GPAS works by scaling down the intermediate activations while keeping their gradients unchanged. This leaves information in the activations intact, and avoids the gradient vanishing problem associated with gradient downscaling. Extensive experiments across various model sizes from 71M to 1B show that GPAS achieves consistent performance gains. Beyond enhancing Pre-LN Transformers, GPAS also shows promise in improving alternative architectures such as Sandwich-LN and DeepNorm, demonstrating its versatility and potential for improving training dynamics in a wide range of settings. Our code is available at https://github.com/dandingsky/GPAS.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes