LGJul 4, 2025

Decoupled Relative Learning Rate Schedules

Jan Ludziejewski, Jan Małaśnicki, Maciej Pióro, Michał Krutul, Kamil Ciebiera, Maciej Stefaniak, Jakub Krajewski, Piotr Sankowski, Marek Cygan, Kamil Adamczewski, Sebastian Jaszczur

arXiv:2507.03526v19.42 citationsh-index: 6

Originality Incremental advance

AI Analysis

This provides a practical and scalable solution for reducing training time and computational resources in large-scale neural networks, though it is incremental as it builds on existing optimization techniques.

The paper tackles the problem of inefficient LLM training by introducing a method that adjusts learning rates across different Transformer components, achieving up to 23% faster training, especially in complex models like Mixture of Experts.

In this work, we introduce a novel approach for optimizing LLM training by adjusting learning rates across weights of different components in Transformer models. Traditional methods often apply a uniform learning rate across all network layers, potentially overlooking the unique dynamics of each part. Remarkably, our introduced relative learning rates, RLRS, method accelerates the training process by up to $23\%$, particularly in complex models such as Mixture of Experts (MoE). Hyperparameters of RLRS can be efficiently tuned on smaller models and then effectively reused on models up to $27\times$ larger. This simple and effective method results in a substantial reduction in training time and computational resources, offering a practical and scalable solution for optimizing large-scale neural networks.

View on arXiv PDF

Similar