On the Convex Behavior of Deep Neural Networks in Relation to the Layers' Width
This provides theoretical insights into optimization dynamics for deep learning practitioners, but it is incremental as it builds on existing Hessian analysis without introducing new methods or broad applications.
The paper investigates the Hessian structure of wide neural networks, showing that during training, the loss surface exhibits positive curvature at the start and end, with near-zero curvature in between, due to the dominance of the Gauss-Newton matrix. It explains this by proving that gradients in over-parameterized networks are orthogonal to negative curvature components, ensuring positive curvature in the gradient direction.
The Hessian of neural networks can be decomposed into a sum of two matrices: (i) the positive semidefinite generalized Gauss-Newton matrix G, and (ii) the matrix H containing negative eigenvalues. We observe that for wider networks, minimizing the loss with the gradient descent optimization maneuvers through surfaces of positive curvatures at the start and end of training, and close to zero curvatures in between. In other words, it seems that during crucial parts of the training process, the Hessian in wide networks is dominated by the component G. To explain this phenomenon, we show that when initialized using common methodologies, the gradients of over-parameterized networks are approximately orthogonal to H, such that the curvature of the loss surface is strictly positive in the direction of the gradient.