LG MLMar 14, 2019

Inefficiency of K-FAC for Large Batch Size Training

Linjian Ma, Gabe Montague, Jiayu Ye, Zhewei Yao, Amir Gholami, Kurt Keutzer, Michael W. Mahoney

arXiv:1903.06237v312.224 citations

Originality Synthesis-oriented

AI Analysis

This addresses the problem of inefficient large-batch training for machine learning practitioners, showing that proposed solutions like K-FAC are incremental and not as effective as hoped.

The paper investigates the scalability of K-FAC and SGD for large batch size training in neural networks, finding that K-FAC does not improve scalability compared to SGD and both methods show diminishing returns beyond a critical batch size, with K-FAC also requiring more hyperparameter tuning.

In stochastic optimization, using large batch sizes during training can leverage parallel resources to produce faster wall-clock training times per training epoch. However, for both training loss and testing error, recent results analyzing large batch Stochastic Gradient Descent (SGD) have found sharp diminishing returns, beyond a certain critical batch size. In the hopes of addressing this, it has been suggested that the Kronecker-Factored Approximate Curvature (\mbox{K-FAC}) method allows for greater scalability to large batch sizes, for non-convex machine learning problems such as neural network optimization, as well as greater robustness to variation in model hyperparameters. Here, we perform a detailed empirical analysis of large batch size training %of these two hypotheses, for both \mbox{K-FAC} and SGD, evaluating performance in terms of both wall-clock time and aggregate computational cost. Our main results are twofold: first, we find that both \mbox{K-FAC} and SGD doesn't have ideal scalability behavior beyond a certain batch size, and that \mbox{K-FAC} does not exhibit improved large-batch scalability behavior, as compared to SGD; and second, we find that \mbox{K-FAC}, in addition to requiring more hyperparameters to tune, suffers from similar hyperparameter sensitivity behavior as does SGD. We discuss extensive results using ResNet and AlexNet on \mbox{CIFAR-10} and SVHN, respectively, as well as more general implications of our findings.

View on arXiv PDF

Similar