LGCVMar 6, 2024

Inverse-Free Fast Natural Gradient Descent Method for Deep Learning

arXiv:2403.03473v2h-index: 17
AI Analysis

This addresses the problem of slow convergence in deep learning optimization for practitioners by offering a more efficient second-order method, though it is incremental as it builds on existing natural gradient descent techniques.

The paper tackles the computational inefficiency of second-order optimization in deep learning by proposing a fast natural gradient descent (FNGD) method that avoids iterative matrix inversions, achieving a 2.07x speedup over KFAC on ResNet-18/CIFAR-100 and a 24 BLEU score improvement over AdamW on Transformer/Multi30K with similar training time.

Second-order optimization techniques have the potential to achieve faster convergence rates compared to first-order methods through the incorporation of second-order derivatives or statistics. However, their utilization in deep learning is limited due to their computational inefficiency. Various approaches have been proposed to address this issue, primarily centered on minimizing the size of the matrix to be inverted. Nevertheless, the necessity of performing the inverse operation iteratively persists. In this work, we present a fast natural gradient descent (FNGD) method that only requires inversion during the first epoch. Specifically, it is revealed that natural gradient descent (NGD) is essentially a weighted sum of per-sample gradients. Our novel approach further proposes to share these weighted coefficients across epochs without affecting empirical performance. Consequently, FNGD exhibits similarities to the average sum in first-order methods, leading to the computational complexity of FNGD being comparable to that of first-order methods. Extensive experiments on image classification and machine translation tasks demonstrate the efficiency of the proposed FNGD. For training ResNet-18 on CIFAR-100, FNGD can achieve a speedup of 2.07$\times$ compared with KFAC. For training Transformer on Multi30K, FNGD outperforms AdamW by 24 BLEU score while requiring almost the same training time.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes