LG OCSep 29, 2022

Restricted Strong Convexity of Deep Learning Models with Smooth Activations

Arindam Banerjee, Pedro Cisneros-Velarde, Libin Zhu, Mikhail Belkin

arXiv:2209.15106v110.411 citationsh-index: 55

Originality Incremental advance

AI Analysis

This work provides an alternative convergence condition for deep learning optimization, which is incremental but addresses a known bottleneck in theoretical analysis for researchers in machine learning theory.

The paper tackles the optimization of deep learning models with smooth activation functions by establishing a sharper upper bound on the Hessian's spectral norm and introducing a Restricted Strong Convexity (RSC) analysis that guarantees geometric convergence for gradient descent without relying on the Neural Tangent Kernel (NTK). It presents theoretical results with bounds scaling as O(poly(L)/√m) and preliminary experimental support.

We consider the problem of optimization of deep learning models with smooth activation functions. While there exist influential results on the problem from the ``near initialization'' perspective, we shed considerable new light on the problem. In particular, we make two key technical contributions for such models with $L$ layers, $m$ width, and $σ_0^2$ initialization variance. First, for suitable $σ_0^2$, we establish a $O(\frac{\text{poly}(L)}{\sqrt{m}})$ upper bound on the spectral norm of the Hessian of such models, considerably sharpening prior results. Second, we introduce a new analysis of optimization based on Restricted Strong Convexity (RSC) which holds as long as the squared norm of the average gradient of predictors is $Ω(\frac{\text{poly}(L)}{\sqrt{m}})$ for the square loss. We also present results for more general losses. The RSC based analysis does not need the ``near initialization" perspective and guarantees geometric convergence for gradient descent (GD). To the best of our knowledge, ours is the first result on establishing geometric convergence of GD based on RSC for deep learning models, thus becoming an alternative sufficient condition for convergence that does not depend on the widely-used Neural Tangent Kernel (NTK). We share preliminary experimental results supporting our theoretical advances.

View on arXiv PDF

Similar