LG OCJan 17, 2024

Bilevel Optimization under Unbounded Smoothness: A New Algorithm and Convergence Analysis

arXiv:2401.09587v117.015 citationsh-index: 5Has CodeICLR

Originality Highly original

AI Analysis

This addresses a bottleneck in bilevel optimization for specific neural network architectures where conventional methods fail due to unbounded smoothness.

The paper tackles bilevel optimization for neural networks with unbounded smoothness (like RNNs/LSTMs) by proposing BO-REP algorithm with normalized momentum and periodic lower-level updates, achieving Õ(1/ε⁴) iteration complexity to find ε-stationary points, matching state-of-the-art bounded smoothness results.

Bilevel optimization is an important formulation for many machine learning problems. Current bilevel optimization algorithms assume that the gradient of the upper-level function is Lipschitz. However, recent studies reveal that certain neural networks such as recurrent neural networks (RNNs) and long-short-term memory networks (LSTMs) exhibit potential unbounded smoothness, rendering conventional bilevel optimization algorithms unsuitable. In this paper, we design a new bilevel optimization algorithm, namely BO-REP, to address this challenge. This algorithm updates the upper-level variable using normalized momentum and incorporates two novel techniques for updating the lower-level variable: \textit{initialization refinement} and \textit{periodic updates}. Specifically, once the upper-level variable is initialized, a subroutine is invoked to obtain a refined estimate of the corresponding optimal lower-level variable, and the lower-level variable is updated only after every specific period instead of each iteration. When the upper-level problem is nonconvex and unbounded smooth, and the lower-level problem is strongly convex, we prove that our algorithm requires $\widetilde{\mathcal{O}}(1/ε^4)$ iterations to find an $ε$-stationary point in the stochastic setting, where each iteration involves calling a stochastic gradient or Hessian-vector product oracle. Notably, this result matches the state-of-the-art complexity results under the bounded smoothness setting and without mean-squared smoothness of the stochastic gradient, up to logarithmic factors. Our proof relies on novel technical lemmas for the periodically updated lower-level variable, which are of independent interest. Our experiments on hyper-representation learning, hyperparameter optimization, and data hyper-cleaning for text classification tasks demonstrate the effectiveness of our proposed algorithm.

View on arXiv PDF Code

Similar