Information-Theoretic Perspectives on Optimizers
This work addresses the complex interplay between optimizers and architectures in neural networks, offering incremental insights through information-theoretic analysis.
The authors tackled the problem of understanding why certain optimizers perform better on specific neural network architectures by introducing information-theoretic metrics called entropy gap, which they found affects optimization dynamics and generalization alongside traditional sharpness metrics. They applied these tools to analyze and improve the Lion optimizer.
The interplay of optimizers and architectures in neural networks is complicated and hard to understand why some optimizers work better on some specific architectures. In this paper, we find that the traditionally used sharpness metric does not fully explain the intricate interplay and introduces information-theoretic metrics called entropy gap to better help analyze. It is found that both sharpness and entropy gap affect the performance, including the optimization dynamic and generalization. We further use information-theoretic tools to understand a recently proposed optimizer called Lion and find ways to improve it.