FlexCTC: GPU-powered CTC Beam Decoding With Advanced Contextual Abilities
This provides a faster, more efficient alternative to existing decoders for speech recognition researchers and practitioners, though it is incremental as it builds on established CTC and beam search methods.
The paper tackles the slow and CPU-bound nature of standard beam search for CTC-based speech recognition by introducing FlexCTC, a fully GPU-based toolkit that eliminates CPU-GPU synchronization and supports advanced contextualization, achieving fast and efficient decoding suitable for research and production.
While beam search improves speech recognition quality over greedy decoding, standard implementations are slow, often sequential, and CPU-bound. To fully leverage modern hardware capabilities, we present a novel open-source FlexCTC toolkit for fully GPU-based beam decoding, designed for Connectionist Temporal Classification (CTC) models. Developed entirely in Python and PyTorch, it offers a fast, user-friendly, and extensible alternative to traditional C++, CUDA, or WFST-based decoders. The toolkit features a high-performance, fully batched GPU implementation with eliminated CPU-GPU synchronization and minimized kernel launch overhead via CUDA Graphs. It also supports advanced contextualization techniques, including GPU-powered N-gram language model fusion and phrase-level boosting. These features enable accurate and efficient decoding, making them suitable for both research and production use.