LG AIMay 14

GQA-μP: The maximal parameterization update for grouped query attention

Kyle R. Chickering, Huijuan Wang, Mengxi Wu, Alexander Moreno, Muhao Chen, Xuezhe Ma, Daria Soboleva, Joel Hestness, Zhengzhong Liu, Eric Xing

arXiv:2605.1529078.8

AI Analysis

This work enables hyperparameter transfer for GQA architectures, reducing tuning costs for LLMs, but is incremental as it extends existing μP theory to a specific attention variant.

The paper derives maximal update parameterization (μP) scalings for grouped-query attention (GQA) by promoting spectral norm conditions to define feature learning and using a modified spectral norm for non-full-rank weight matrices. Experiments demonstrate learning rate transfer across GQA repetition hyperparameters and weight decay.

Hyperparameter transfer across model architectures dramatically reduces the amount of compute necessary for tuning large language models (LLMs). The maximal update parameterization (μP) ensures transfer through principled mathematical analysis but can be challenging to derive for new model architectures. Building on the spectral feature-learning view of Yang et al. (2023a), we make two advances. First, we promote spectral norm conditions on the weights from a heuristic to the definition of feature learning, and as a consequence arrive at the Complete-P depth and weight-decay scalings without recourse to lazy-learning. Second, we consider a modified spectral norm that preserves the valid scaling law of network weights when weight matrices are not full rank. This enables (to our knowledge, the first) derivation of μP scalings for grouped-query attention (GQA). We demonstrate the efficacy of our theoretical derivations by showing learning rate transfer across the GQA repetition hyperparameter as well as experiments regarding transfer over weight decay.

View on arXiv PDF

Similar