On the Parameterization of Second-Order Optimization Effective Towards the Infinite Width
This work addresses the problem of training larger-scale models efficiently for researchers and practitioners in machine learning, though it is incremental as it builds on existing second-order optimization methods like K-FAC and Shampoo.
The study tackled the challenge of scaling second-order optimization for deep neural networks by identifying a specific parameterization that promotes stable feature learning as network width increases, achieving higher generalization performance and enabling hyperparameter transfer across models with different widths.
Second-order optimization has been developed to accelerate the training of deep neural networks and it is being applied to increasingly larger-scale models. In this study, towards training on further larger scales, we identify a specific parameterization for second-order optimization that promotes feature learning in a stable manner even if the network width increases significantly. Inspired by a maximal update parameterization, we consider a one-step update of the gradient and reveal the appropriate scales of hyperparameters including random initialization, learning rates, and damping terms. Our approach covers two major second-order optimization algorithms, K-FAC and Shampoo, and we demonstrate that our parameterization achieves higher generalization performance in feature learning. In particular, it enables us to transfer the hyperparameters across models with different widths.