OC LG MLDec 11, 2024

Criteria and Bias of Parameterized Linear Regression under Edge of Stability Regime

arXiv:2412.08025v13.2h-index: 5

Originality Incremental advance

AI Analysis

This work provides a more comprehensive understanding of EoS for researchers in optimization and machine learning, though it is incremental as it extends prior findings to specific models like depth-2 diagonal linear networks.

The paper challenges the belief that subquadratic loss functions are necessary for the Edge of Stability (EoS) phenomenon in gradient descent, showing empirically and theoretically that EoS can occur even with quadratic loss under certain conditions, leading to convergence to a linear interpolator in a non-asymptotic manner.

Classical optimization theory requires a small step-size for gradient-based methods to converge. Nevertheless, recent findings challenge the traditional idea by empirically demonstrating Gradient Descent (GD) converges even when the step-size $η$ exceeds the threshold of $2/L$, where $L$ is the global smooth constant. This is usually known as the Edge of Stability (EoS) phenomenon. A widely held belief suggests that an objective function with subquadratic growth plays an important role in incurring EoS. In this paper, we provide a more comprehensive answer by considering the task of finding linear interpolator $β\in R^{d}$ for regression with loss function $l(\cdot)$, where $β$ admits parameterization as $β= w^2_{+} - w^2_{-}$. Contrary to the previous work that suggests a subquadratic $l$ is necessary for EoS, our novel finding reveals that EoS occurs even when $l$ is quadratic under proper conditions. This argument is made rigorous by both empirical and theoretical evidence, demonstrating the GD trajectory converges to a linear interpolator in a non-asymptotic way. Moreover, the model under quadratic $l$, also known as a depth-$2$ diagonal linear network, remains largely unexplored under the EoS regime. Our analysis then sheds some new light on the implicit bias of diagonal linear networks when a larger step-size is employed, enriching the understanding of EoS on more practical models.

View on arXiv PDF

Similar