LGMar 19

$μ$LO: Compute-Efficient Meta-Generalization of Learned Optimizers

arXiv:2406.0015387.610 citationsh-index: 18
AI Analysis

This addresses the compute efficiency and generalization issues of learned optimizers for neural network training, representing an incremental improvement with specific gains.

The paper tackles the problem of learned optimizers struggling to generalize to unseen tasks, especially wider networks, by proposing a meta-training recipe based on Maximal Update Parametrization. The result shows substantial improvement in meta-generalization to wider tasks and unexpected gains for deeper networks (5x) and longer training horizons (25x) compared to standard methods.

Learned optimizers (LOs) have the potential to significantly reduce the wall-clock training time of neural networks. However, they can struggle to optimize unseen tasks (meta-generalize), especially when training networks wider than those seen during meta-training. To address this, we derive the Maximal Update Parametrization ($μ$P) for two state-of-the-art learned optimizer architectures and propose a simple meta-training recipe for $μ$-parameterized LOs ($μ$LOs). Our empirical evaluation demonstrates that LOs meta-trained with our recipe substantially improve meta-generalization to wider unseen tasks when compared to LOs trained under standard parametrization (SP) using the same compute budget. We also empirically observe that $μ$LOs exhibit unexpectedly improved meta-generalization to deeper networks ($5\times$ meta-training) and surprising generalization to much longer training horizons ($25\times$ meta-training) when compared to SP LOs.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes