On the Power-Law Hessian Spectrums in Deep Learning
This work addresses the theoretical understanding of Hessian spectra in deep learning, which is crucial for optimization and generalization, but it is incremental as it builds on prior empirical discoveries.
The paper demonstrates that the Hessian spectra of well-trained deep neural networks exhibit power-law structures, providing a maximum-entropy theoretical interpretation inspired by statistical physics and protein evolution, and uses this framework to explore novel behaviors in deep learning.
It is well-known that the Hessian of deep loss landscape matters to optimization, generalization, and even robustness of deep learning. Recent works empirically discovered that the Hessian spectrum in deep learning has a two-component structure that consists of a small number of large eigenvalues and a large number of nearly-zero eigenvalues. However, the theoretical mechanism or the mathematical behind the Hessian spectrum is still largely under-explored. To the best of our knowledge, we are the first to demonstrate that the Hessian spectrums of well-trained deep neural networks exhibit simple power-law structures. Inspired by the statistical physical theories and the spectral analysis of natural proteins, we provide a maximum-entropy theoretical interpretation for explaining why the power-law structure exist and suggest a spectral parallel between protein evolution and training of deep neural networks. By conducing extensive experiments, we further use the power-law spectral framework as a useful tool to explore multiple novel behaviors of deep learning.