LGOct 13, 2020
Revisiting BFloat16 TrainingPedram Zamirai, Jian Zhang, Christopher R. Aberger et al.
State-of-the-art generic low-precision training algorithms use a mix of 16-bit and 32-bit precision, creating the folklore that 16-bit hardware compute units alone are not enough to maximize model accuracy. As a result, deep learning accelerators are forced to support both 16-bit and 32-bit floating-point units (FPUs), which is more costly than only using 16-bit FPUs for hardware design. We ask: can we train deep learning models only with 16-bit floating-point units, while still matching the model accuracy attained by 32-bit training? Towards this end, we study 16-bit-FPU training on the widely adopted BFloat16 unit. While these units conventionally use nearest rounding to cast output to 16-bit precision, we show that nearest rounding for model weight updates often cancels small updates, which degrades the convergence and model accuracy. Motivated by this, we study two simple techniques well-established in numerical analysis, stochastic rounding and Kahan summation, to remedy the model accuracy degradation in 16-bit-FPU training. We demonstrate that these two techniques can enable up to 7% absolute validation accuracy gain in 16-bit-FPU training. This leads to 0.1% lower to 0.2% higher validation accuracy compared to 32-bit training across seven deep learning applications.
CLFeb 29, 2020
Understanding the Downstream Instability of Word EmbeddingsMegan Leszczynski, Avner May, Jian Zhang et al.
Many industrial machine learning (ML) systems require frequent retraining to keep up-to-date with constantly changing data. This retraining exacerbates a large challenge facing ML systems today: model training is unstable, i.e., small changes in training data can cause significant changes in the model's predictions. In this paper, we work on developing a deeper understanding of this instability, with a focus on how a core building block of modern natural language processing (NLP) pipelines---pre-trained word embeddings---affects the instability of downstream NLP models. We first empirically reveal a tradeoff between stability and memory: increasing the embedding memory 2x can reduce the disagreement in predictions due to small changes in training data by 5% to 37% (relative). To theoretically explain this tradeoff, we introduce a new measure of embedding instability---the eigenspace instability measure---which we prove bounds the disagreement in downstream predictions introduced by the change in word embeddings. Practically, we show that the eigenspace instability measure can be a cost-effective way to choose embedding parameters to minimize instability without training downstream models, outperforming other embedding distance measures and performing competitively with a nearest neighbor-based measure. Finally, we demonstrate that the observed stability-memory tradeoffs extend to other types of embeddings as well, including knowledge graph and contextual word embeddings.
DCOct 9, 2019
PipeMare: Asynchronous Pipeline Parallel DNN TrainingBowen Yang, Jian Zhang, Jonathan Li et al.
Pipeline parallelism (PP) when training neural networks enables larger models to be partitioned spatially, leading to both lower network communication and overall higher hardware utilization. Unfortunately, to preserve the statistical efficiency of sequential training, existing PP techniques sacrifice hardware efficiency by decreasing pipeline utilization or incurring extra memory costs. In this paper, we investigate to what extent these sacrifices are necessary. We devise PipeMare, a simple yet robust training method that tolerates asynchronous updates during PP execution without sacrificing utilization or memory, which allows efficient use of fine-grained pipeline parallelism. Concretely, when tested on ResNet and Transformer networks, asynchrony enables PipeMare to use up to $2.7\times$ less memory or get $4.3\times$ higher pipeline utilization, with similar model quality, when compared to state-of-the-art synchronous PP training techniques.
LGApr 24, 2019
Low-Memory Neural Network Training: A Technical ReportNimit S. Sohoni, Christopher R. Aberger, Megan Leszczynski et al.
Memory is increasingly often the bottleneck when training neural network models. Despite this, techniques to lower the overall memory requirements of training have been less widely studied compared to the extensive literature on reducing the memory requirements of inference. In this paper we study a fundamental question: How much memory is actually needed to train a neural network? To answer this question, we profile the overall memory usage of training on two representative deep learning benchmarks -- the WideResNet model for image classification and the DynamicConv Transformer model for machine translation -- and comprehensively evaluate four standard techniques for reducing the training memory requirements: (1) imposing sparsity on the model, (2) using low precision, (3) microbatching, and (4) gradient checkpointing. We explore how each of these techniques in isolation affects both the peak memory usage of training and the quality of the end model, and explore the memory, accuracy, and computation tradeoffs incurred when combining these techniques. Using appropriate combinations of these techniques, we show that it is possible to the reduce the memory required to train a WideResNet-28-2 on CIFAR-10 by up to 60.7x with a 0.4% loss in accuracy, and reduce the memory required to train a DynamicConv model on IWSLT'14 German to English translation by up to 8.7x with a BLEU score drop of 0.15.
LGMar 9, 2018
High-Accuracy Low-Precision TrainingChristopher De Sa, Megan Leszczynski, Jian Zhang et al.
Low-precision computation is often used to lower the time and energy cost of machine learning, and recently hardware accelerators have been developed to support it. Still, it has been used primarily for inference - not training. Previous low-precision training algorithms suffered from a fundamental tradeoff: as the number of bits of precision is lowered, quantization noise is added to the model, which limits statistical accuracy. To address this issue, we describe a simple low-precision stochastic gradient descent variant called HALP. HALP converges at the same theoretical rate as full-precision algorithms despite the noise introduced by using low precision throughout execution. The key idea is to use SVRG to reduce gradient variance, and to combine this with a novel technique called bit centering to reduce quantization error. We show that on the CPU, HALP can run up to $4 \times$ faster than full-precision SVRG and can match its convergence trajectory. We implemented HALP in TensorQuant, and show that it exceeds the validation performance of plain low-precision SGD on two deep learning tasks.