LGAIFeb 1, 2022

Few-Bit Backward: Quantized Gradients of Activation Functions for Memory Footprint Reduction

arXiv:2202.00441v220 citations
Originality Incremental advance
AI Analysis

This addresses memory limitations in training large neural networks, offering a drop-in solution for existing pipelines, though it is incremental as it builds on existing quantization methods.

The paper tackles the memory footprint problem in large neural network training by quantizing gradients of activation functions, achieving significant memory reduction with only a few bits per element while maintaining convergence on benchmarks.

Memory footprint is one of the main limiting factors for large neural network training. In backpropagation, one needs to store the input to each operation in the computational graph. Every modern neural network model has quite a few pointwise nonlinearities in its architecture, and such operation induces additional memory costs which -- as we show -- can be significantly reduced by quantization of the gradients. We propose a systematic approach to compute optimal quantization of the retained gradients of the pointwise nonlinear functions with only a few bits per each element. We show that such approximation can be achieved by computing optimal piecewise-constant approximation of the derivative of the activation function, which can be done by dynamic programming. The drop-in replacements are implemented for all popular nonlinearities and can be used in any existing pipeline. We confirm the memory reduction and the same convergence on several open benchmarks.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes