Neural Transducer Training: Reduced Memory Consumption with Sample-wise Computation
This work addresses a critical bottleneck for researchers and practitioners in speech recognition by enabling larger batch sizes and longer sequences on limited GPU memory, though it is incremental as it optimizes an existing training process.
The paper tackles the high memory consumption in neural transducer training for automatic speech recognition by proposing a sample-wise computation method, which reduces memory usage to 6 GB for a batch size of 1024 and 40-second audio lengths while maintaining competitive speed.
The neural transducer is an end-to-end model for automatic speech recognition (ASR). While the model is well-suited for streaming ASR, the training process remains challenging. During training, the memory requirements may quickly exceed the capacity of state-of-the-art GPUs, limiting batch size and sequence lengths. In this work, we analyze the time and space complexity of a typical transducer training setup. We propose a memory-efficient training method that computes the transducer loss and gradients sample by sample. We present optimizations to increase the efficiency and parallelism of the sample-wise method. In a set of thorough benchmarks, we show that our sample-wise method significantly reduces memory usage, and performs at competitive speed when compared to the default batched computation. As a highlight, we manage to compute the transducer loss and gradients for a batch size of 1024, and audio length of 40 seconds, using only 6 GB of memory.