LG MLJan 24, 2019

Width Provably Matters in Optimization for Deep Linear Neural Networks

arXiv:1901.08572v324.4104 citations

Originality Incremental advance

AI Analysis

This addresses the optimization difficulty in deep learning by providing theoretical guarantees on convergence for wide networks, which is foundational for understanding model design.

The paper proves that for deep linear neural networks, if hidden layer widths exceed a polynomial bound depending on rank, condition number, and output dimension, gradient descent converges linearly to a global minimum, requiring O(κ log(1/ε)) iterations to find an ε-suboptimal solution, and contrasts this with an exponential lower bound for narrow networks to show width is necessary.

We prove that for an $L$-layer fully-connected linear neural network, if the width of every hidden layer is $\tildeΩ(L \cdot r \cdot d_{\mathrm{out}} \cdot κ^3 )$, where $r$ and $κ$ are the rank and the condition number of the input data, and $d_{\mathrm{out}}$ is the output dimension, then gradient descent with Gaussian random initialization converges to a global minimum at a linear rate. The number of iterations to find an $ε$-suboptimal solution is $O(κ\log(\frac{1}ε))$. Our polynomial upper bound on the total running time for wide deep linear networks and the $\exp\left(Ω\left(L\right)\right)$ lower bound for narrow deep linear neural networks [Shamir, 2018] together demonstrate that wide layers are necessary for optimizing deep models.

View on arXiv PDF

Similar