Towards Maximum Likelihood Training for Transducer-based Streaming Speech Recognition
This work addresses a specific bottleneck in streaming ASR for speech recognition applications, offering an incremental improvement.
The paper tackled the mismatch between training and inference in streaming transducer-based speech recognition, which causes deformed likelihood and suboptimal accuracy, by introducing forward variable causal compensation (FoCC) and its estimator FoCCE to estimate exact likelihood, resulting in improved accuracy on the LibriSpeech dataset.
Transducer neural networks have emerged as the mainstream approach for streaming automatic speech recognition (ASR), offering state-of-the-art performance in balancing accuracy and latency. In the conventional framework, streaming transducer models are trained to maximize the likelihood function based on non-streaming recursion rules. However, this approach leads to a mismatch between training and inference, resulting in the issue of deformed likelihood and consequently suboptimal ASR accuracy. We introduce a mathematical quantification of the gap between the actual likelihood and the deformed likelihood, namely forward variable causal compensation (FoCC). We also present its estimator, FoCCE, as a solution to estimate the exact likelihood. Through experiments on the LibriSpeech dataset, we show that FoCCE training improves the accuracy of the streaming transducers.