Label-Looping: Highly Efficient Decoding for Transducers
This work addresses a computational bottleneck in speech recognition systems, offering incremental improvements in decoding speed for researchers and practitioners.
The paper tackles the inefficiency of greedy decoding in Transducer-based speech recognition by introducing a label-looping algorithm that swaps the loops over frames and labels and uses CUDA tensors for parallelization, achieving up to 2.0X faster decoding than conventional methods at batch size 32.
This paper introduces a highly efficient greedy decoding algorithm for Transducer-based speech recognition models. We redesign the standard nested-loop design for RNN-T decoding, swapping loops over frames and labels: the outer loop iterates over labels, while the inner loop iterates over frames searching for the next non-blank symbol. Additionally, we represent partial hypotheses in a special structure using CUDA tensors, supporting parallelized hypotheses manipulations. Experiments show that the label-looping algorithm is up to 2.0X faster than conventional batched decoding when using batch size 32. It can be further combined with other compiler or GPU call-related techniques to achieve even more speedup. Our algorithm is general-purpose and can work with both conventional Transducers and Token-and-Duration Transducers. We open-source our implementation to benefit the research community.