LGMLOct 14, 2024

A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning

arXiv:2410.10417v13 citationsh-index: 77AAAI
Originality Highly original
AI Analysis

This addresses the problem of unstable and sensitive bi-level optimization methods in meta-learning for deep learning practitioners, offering a more reliable solution.

The paper tackles the challenge of bi-level optimization in meta-learning by reformulating it as a stochastic optimization problem, using Stochastic Gradient Langevin Dynamics to sample inner distributions and a recurrent algorithm for gradient estimation. The method achieves robust performance across diverse meta-learning tasks and scales to learning 87M hyperparameters in Vision Transformers.

We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning, including hyperparameter optimization, loss function learning, few-shot learning, invariance learning and more. These problems are often formalized as Bi-Level optimizations (BLO). We introduce a novel perspective by turning a given BLO problem into a stochastic optimization, where the inner loss function becomes a smooth probability distribution, and the outer loss becomes an expected loss over the inner distribution. To solve this stochastic optimization, we adopt Stochastic Gradient Langevin Dynamics (SGLD) MCMC to sample inner distribution, and propose a recurrent algorithm to compute the MC-estimated hypergradient. Our derivation is similar to forward-mode differentiation, but we introduce a new first-order approximation that makes it feasible for large models without needing to store huge Jacobian matrices. The main benefits are two-fold: i) Our stochastic formulation takes into account uncertainty, which makes the method robust to suboptimal inner optimization or non-unique multiple inner minima due to overparametrization; ii) Compared to existing methods that often exhibit unstable behavior and hyperparameter sensitivity in practice, our method leads to considerably more reliable solutions. We demonstrate that the new approach achieves promising results on diverse meta learning problems and easily scales to learning 87M hyperparameters in the case of Vision Transformers.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes