LG AIFeb 8, 2022

A Mini-Block Fisher Method for Deep Neural Networks

arXiv:2202.04124v48.711 citationsh-index: 49

Originality Incremental advance

AI Analysis

This work addresses the need for more efficient second-order optimization methods in deep learning, though it is incremental as it builds on existing block-diagonal approximations.

The authors tackled the problem of efficiently incorporating curvature information into deep neural network training by proposing a mini-block Fisher (MBF) preconditioned gradient method, which achieves computational costs only slightly higher than first-order methods while demonstrating effectiveness in time efficiency and generalization on autoencoder and CNN tasks.

Deep neural networks (DNNs) are currently predominantly trained using first-order methods. Some of these methods (e.g., Adam, AdaGrad, and RMSprop, and their variants) incorporate a small amount of curvature information by using a diagonal matrix to precondition the stochastic gradient. Recently, effective second-order methods, such as KFAC, K-BFGS, Shampoo, and TNT, have been developed for training DNNs, by preconditioning the stochastic gradient by layer-wise block-diagonal matrices. Here we propose a "mini-block Fisher (MBF)" preconditioned gradient method, that lies in between these two classes of methods. Specifically, our method uses a block-diagonal approximation to the empirical Fisher matrix, where for each layer in the DNN, whether it is convolutional or feed-forward and fully connected, the associated diagonal block is itself block-diagonal and is composed of a large number of mini-blocks of modest size. Our novel approach utilizes the parallelism of GPUs to efficiently perform computations on the large number of matrices in each layer. Consequently, MBF's per-iteration computational cost is only slightly higher than it is for first-order methods. The performance of our proposed method is compared to that of several baseline methods, on both autoencoder and CNN problems, to validate its effectiveness both in terms of time efficiency and generalization power. Finally, it is proved that an idealized version of MBF converges linearly.

View on arXiv PDF

Similar