LGSep 24, 2023

Accelerating Large Batch Training via Gradient Signal to Noise Ratio (GSNR)

Guo-qing Jiang, Jinlong Liu, Zixiang Ding, Lin Guo, Wei Lin

arXiv:2309.13681v16.62 citationsh-index: 25

Originality Highly original

AI Analysis

This work addresses a critical bottleneck in scaling deep learning training for large models, offering significant improvements in efficiency and accuracy for practitioners in NLP, CV, and recommendation systems.

The paper tackles the problem of large generalization gaps and reduced accuracy in large-batch training for NLP, CV, and recommendation systems by developing a variance reduced gradient descent technique based on gradient signal-to-noise ratio, which accelerates training by 1-2x, pushes batch size limits up to 512k without accuracy loss, and reduces generalization gaps by over 65%.

As models for nature language processing (NLP), computer vision (CV) and recommendation systems (RS) require surging computation, a large number of GPUs/TPUs are paralleled as a large batch (LB) to improve training throughput. However, training such LB tasks often meets large generalization gap and downgrades final precision, which limits enlarging the batch size. In this work, we develop the variance reduced gradient descent technique (VRGD) based on the gradient signal to noise ratio (GSNR) and apply it onto popular optimizers such as SGD/Adam/LARS/LAMB. We carry out a theoretical analysis of convergence rate to explain its fast training dynamics, and a generalization analysis to demonstrate its smaller generalization gap on LB training. Comprehensive experiments demonstrate that VRGD can accelerate training ($1\sim 2 \times$), narrow generalization gap and improve final accuracy. We push the batch size limit of BERT pretraining up to 128k/64k and DLRM to 512k without noticeable accuracy loss. We improve ImageNet Top-1 accuracy at 96k by $0.52pp$ than LARS. The generalization gap of BERT and ImageNet training is significantly reduce by over $65\%$.

View on arXiv PDF

Similar