Exponential Convergence Time of Gradient Descent for One-Dimensional Deep Linear Neural Networks
This reveals a potential obstacle for training deep linear networks, which is incremental as it builds on prior convergence studies.
The paper tackles the problem of gradient descent convergence for deep linear neural networks, proving that the number of iterations scales exponentially with network depth under random initializations and mild assumptions, and empirically shows this phenomenon extends to higher dimensions.
We study the dynamics of gradient descent on objective functions of the form $f(\prod_{i=1}^{k} w_i)$ (with respect to scalar parameters $w_1,\ldots,w_k$), which arise in the context of training depth-$k$ linear neural networks. We prove that for standard random initializations, and under mild assumptions on $f$, the number of iterations required for convergence scales exponentially with the depth $k$. We also show empirically that this phenomenon can occur in higher dimensions, where each $w_i$ is a matrix. This highlights a potential obstacle in understanding the convergence of gradient-based methods for deep linear neural networks, where $k$ is large.