LGJul 4, 2025

Decoupled Relative Learning Rate Schedules

arXiv:2507.03526v12 citationsh-index: 6
Originality Incremental advance
AI Analysis

This provides a practical and scalable solution for reducing training time and computational resources in large-scale neural networks, though it is incremental as it builds on existing optimization techniques.

The paper tackles the problem of inefficient LLM training by introducing a method that adjusts learning rates across different Transformer components, achieving up to 23% faster training, especially in complex models like Mixture of Experts.

In this work, we introduce a novel approach for optimizing LLM training by adjusting learning rates across weights of different components in Transformer models. Traditional methods often apply a uniform learning rate across all network layers, potentially overlooking the unique dynamics of each part. Remarkably, our introduced relative learning rates, RLRS, method accelerates the training process by up to $23\%$, particularly in complex models such as Mixture of Experts (MoE). Hyperparameters of RLRS can be efficiently tuned on smaller models and then effectively reused on models up to $27\times$ larger. This simple and effective method results in a substantial reduction in training time and computational resources, offering a practical and scalable solution for optimizing large-scale neural networks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes