MLLGFeb 24, 2025

Function-Space Learning Rates

arXiv:2502.17405v23 citationsh-index: 3ICML
Originality Incremental advance
AI Analysis

This addresses the challenge of hyperparameter tuning for neural networks, particularly across different model scales, though it appears incremental as it builds on existing optimization concepts.

The paper tackled the problem of analyzing and controlling neural network training dynamics in function space rather than parameter space, resulting in the development of FLeRM, a method for hyperparameter transfer across model scales that maintains consistent function-space updates with minimal computational overhead.

We consider layerwise function-space learning rates, which measure the magnitude of the change in a neural network's output function in response to an update to a parameter tensor. This contrasts with traditional learning rates, which describe the magnitude of changes in parameter space. We develop efficient methods to measure and set function-space learning rates in arbitrary neural networks, requiring only minimal computational overhead through a few additional backward passes that can be performed at the start of, or periodically during, training. We demonstrate two key applications: (1) analysing the dynamics of standard neural network optimisers in function space, rather than parameter space, and (2) introducing FLeRM (Function-space Learning Rate Matching), a novel approach to hyperparameter transfer across model scales. FLeRM records function-space learning rates while training a small, cheap base model, then automatically adjusts parameter-space layerwise learning rates when training larger models to maintain consistent function-space updates. FLeRM gives hyperparameter transfer across model width, depth, initialisation scale, and LoRA rank in various architectures including MLPs with residual connections and transformers with different layer normalisation schemes.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes