LGJun 10, 2025

MAC: An Efficient Gradient Preconditioning using Mean Activation Approximated Curvature

arXiv:2506.08464v22 citationsh-index: 1ICDM
Originality Incremental advance
AI Analysis

This addresses the computational bottleneck in training neural networks with second-order optimization, particularly for transformers, though it appears incremental as an improvement over KFAC.

The paper tackles the high computational cost of second-order optimization methods like KFAC by proposing MAC, an efficient gradient preconditioning method that approximates curvature using mean activation approximated curvature. The method outperforms KFAC and other state-of-the-art methods in accuracy, training time, and memory usage across various architectures and datasets.

Second-order optimization methods for training neural networks, such as KFAC, exhibit superior convergence by utilizing curvature information of loss landscape. However, it comes at the expense of high computational burden. In this work, we analyze the two components that constitute the layer-wise Fisher information matrix (FIM) used in KFAC: the Kronecker factors related to activations and pre-activation gradients. Based on empirical observations on their eigenspectra, we propose efficient approximations for them, resulting in a computationally efficient optimization method called MAC. To the best of our knowledge, MAC is the first algorithm to apply the Kronecker factorization to the FIM of attention layers used in transformers and explicitly integrate attention scores into the preconditioning. We also study the convergence property of MAC on nonlinear neural networks and provide two conditions under which it converges to global minima. Our extensive evaluations on various network architectures and datasets show that the proposed method outperforms KFAC and other state-of-the-art methods in terms of accuracy, end-to-end training time, and memory usage.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes