Global Convergence of Gradient Descent for Deep Linear Residual Networks
This addresses optimization challenges for deep linear neural networks, particularly for large depths, by showing how residual structures and initialization affect convergence, though it is incremental as it builds on prior work on initialization methods.
The paper tackles the problem of optimizing deep linear residual networks by proposing a zero-asymmetric (ZAS) initialization to avoid saddle points, proving that gradient descent converges to an ε-optimal point in O(L³ log(1/ε)) iterations, which scales polynomially with depth L, compared to exponential scaling with standard initialization.
We analyze the global convergence of gradient descent for deep linear residual networks by proposing a new initialization: zero-asymmetric (ZAS) initialization. It is motivated by avoiding stable manifolds of saddle points. We prove that under the ZAS initialization, for an arbitrary target matrix, gradient descent converges to an $\varepsilon$-optimal point in $O(L^3 \log(1/\varepsilon))$ iterations, which scales polynomially with the network depth $L$. Our result and the $\exp(Ω(L))$ convergence time for the standard initialization (Xavier or near-identity) [Shamir, 2018] together demonstrate the importance of the residual structure and the initialization in the optimization for deep linear neural networks, especially when $L$ is large.