Neural-FST Class Language Model for End-to-End Speech Recognition
This work addresses the need for efficient and accurate language models in speech recognition, particularly for on-device usage, though it appears incremental as it builds on existing methods like neural networks and finite state transducers.
The paper tackled the problem of improving language modeling for end-to-end speech recognition by proposing Neural-FST Class Language Model (NFCLM), which combines neural networks and finite state transducers, resulting in a 15.8% relative reduction in Word Error Rate compared to neural network language models and being 12 times more compact.
We propose Neural-FST Class Language Model (NFCLM) for end-to-end speech recognition, a novel method that combines neural network language models (NNLMs) and finite state transducers (FSTs) in a mathematically consistent framework. Our method utilizes a background NNLM which models generic background text together with a collection of domain-specific entities modeled as individual FSTs. Each output token is generated by a mixture of these components; the mixture weights are estimated with a separately trained neural decider. We show that NFCLM significantly outperforms NNLM by 15.8% relative in terms of Word Error Rate. NFCLM achieves similar performance as traditional NNLM and FST shallow fusion while being less prone to overbiasing and 12 times more compact, making it more suitable for on-device usage.