Inverted Activations: Reducing Memory Footprint in Neural Network Training
This addresses memory efficiency for training large models like transformers, but it is incremental as it optimizes an existing bottleneck.
The paper tackles the memory footprint problem in neural network training by proposing a method that saves output tensors instead of input tensors in pointwise nonlinearity layers, reducing memory usage without affecting accuracy or performance.
The scaling of neural networks with increasing data and model sizes necessitates the development of more efficient deep learning algorithms. A significant challenge in neural network training is the memory footprint associated with activation tensors, particularly in pointwise nonlinearity layers that traditionally save the entire input tensor for the backward pass, leading to substantial memory consumption. In this paper, we propose a modification to the handling of activation tensors in pointwise nonlinearity layers. Our method involves saving the output tensor instead of the input tensor during the forward pass. Since the subsequent layer typically also saves its input tensor, this approach reduces the total memory required by storing only one tensor between layers instead of two. This optimization is especially beneficial for transformer-based architectures like GPT, BERT, Mistral, and Llama. To enable this approach, we utilize the inverse function of the nonlinearity during the backward pass. As the inverse cannot be computed analytically for most nonlinearities, we construct accurate approximations using simpler functions. Experimental results demonstrate that our method significantly reduces memory usage without affecting training accuracy or computational performance. Our implementation is provided as a drop-in replacement for standard nonlinearity layers in the PyTorch framework, facilitating easy adoption without requiring architectural modifications.