Layer-wise Quantization for Quantized Optimistic Dual Averaging
This work addresses efficiency challenges in distributed training for deep learning, offering a domain-specific improvement for handling layer heterogeneity.
The paper tackles the problem of heterogeneous layers in deep neural networks by developing a layer-wise quantization framework with tight bounds and applying it to distributed variational inequalities, resulting in a novel Quantized Optimistic Dual Averaging algorithm that achieves up to a 150% speedup in training time for Wasserstein GAN on 12+ GPUs.
Modern deep neural networks exhibit heterogeneity across numerous layers of various types such as residuals, multi-head attention, etc., due to varying structures (dimensions, activation functions, etc.), distinct representation characteristics, which impact predictions. We develop a general layer-wise quantization framework with tight variance and code-length bounds, adapting to the heterogeneities over the course of training. We then apply a new layer-wise quantization technique within distributed variational inequalities (VIs), proposing a novel Quantized Optimistic Dual Averaging (QODA) algorithm with adaptive learning rates, which achieves competitive convergence rates for monotone VIs. We empirically show that QODA achieves up to a $150\%$ speedup over the baselines in end-to-end training time for training Wasserstein GAN on $12+$ GPUs.