A GPU-based WFST Decoder with Exact Lattice Generation
This work provides a significant speedup for speech recognition systems, particularly benefiting researchers and practitioners using the Kaldi toolkit, though it is incremental as it extends existing methods with GPU optimizations.
The authors tackled the problem of accelerating weighted finite-state transducer (WFST) decoding for speech recognition by implementing a GPU-based decoder with exact lattice generation, achieving identical recognition results and lattice quality while running 3 to 15 times faster than the single-process Kaldi decoder, with up to a 46-fold speedup using parallelism techniques.
We describe initial work on an extension of the Kaldi toolkit that supports weighted finite-state transducer (WFST) decoding on Graphics Processing Units (GPUs). We implement token recombination as an atomic GPU operation in order to fully parallelize the Viterbi beam search, and propose a dynamic load balancing strategy for more efficient token passing scheduling among GPU threads. We also redesign the exact lattice generation and lattice pruning algorithms for better utilization of the GPUs. Experiments on the Switchboard corpus show that the proposed method achieves identical 1-best results and lattice quality in recognition and confidence measure tasks, while running 3 to 15 times faster than the single process Kaldi decoder. The above results are reported on different GPU architectures. Additionally we obtain a 46-fold speedup with sequence parallelism and multi-process service (MPS) in GPU.