Gradient Descent Happens in a Tiny Subspace
This insight could impact optimization and learning in deep learning by revealing underlying dynamics, though it appears incremental as it builds on existing understanding of gradient behavior.
The paper demonstrates that in large-scale deep learning, gradients converge to a small subspace spanned by top Hessian eigenvectors, which is preserved over time, suggesting gradient descent occurs primarily in this subspace, with an example provided in a solvable classification model.
We show that in a variety of large-scale deep learning scenarios the gradient dynamically converges to a very small subspace after a short period of training. The subspace is spanned by a few top eigenvectors of the Hessian (equal to the number of classes in the dataset), and is mostly preserved over long periods of training. A simple argument then suggests that gradient descent may happen mostly in this subspace. We give an example of this effect in a solvable model of classification, and we comment on possible implications for optimization and learning.