ML LGFeb 6, 2025

Guiding Two-Layer Neural Network Lipschitzness via Gradient Descent Learning Rate Constraints

Kyle Sung, Anastasis Kratsios, Noah Forman

ETH Zurich

arXiv:2502.03792v1h-index: 7

Originality Incremental advance

AI Analysis

This addresses the problem of understanding and controlling the Lipschitz regularity in neural network training for researchers in optimization and generalization theory, though it is incremental as it builds on existing ERM and GD frameworks.

The paper shows that applying a decaying learning rate in gradient descent for two-layer neural networks ensures a small Lipschitz constant without harming convergence, leading to generalization bounds independent of overparameterization, with validation from toy experiments.

We demonstrate that applying an eventual decay to the learning rate (LR) in empirical risk minimization (ERM), where the mean-squared-error loss is minimized using standard gradient descent (GD) for training a two-layer neural network with Lipschitz activation functions, ensures that the resulting network exhibits a high degree of Lipschitz regularity, that is, a small Lipschitz constant. Moreover, we show that this decay does not hinder the convergence rate of the empirical risk, now measured with the Huber loss, toward a critical point of the non-convex empirical risk. From these findings, we derive generalization bounds for two-layer neural networks trained with GD and a decaying LR with a sub-linear dependence on its number of trainable parameters, suggesting that the statistical behaviour of these networks is independent of overparameterization. We validate our theoretical results with a series of toy numerical experiments, where surprisingly, we observe that networks trained with constant step size GD exhibit similar learning and regularity properties to those trained with a decaying LR. This suggests that neural networks trained with standard GD may already be highly regular learners.

View on arXiv PDF

Similar