External Language Model Integration for Factorized Neural Transducers
This work addresses accuracy improvements in speech recognition systems using FNT, primarily benefiting the automatic speech recognition community, and is incremental as it builds on existing FNT frameworks.
The paper tackles the problem of improving factorized neural transducers (FNT) by integrating external language models, showing that linear interpolation with predictor output yields better results than shallow fusion. It reports average gains of 18% word error rate reduction (WERR) with lexical adaptation and up to 60% WERR in entity-rich scenarios using combined class-based n-gram and neural LMs.
We propose an adaptation method for factorized neural transducers (FNT) with external language models. We demonstrate that both neural and n-gram external LMs add significantly more value when linearly interpolated with predictor output compared to shallow fusion, thus confirming that FNT forces the predictor to act like regular language models. Further, we propose a method to integrate class-based n-gram language models into FNT framework resulting in accuracy gains similar to a hybrid setup. We show average gains of 18% WERR with lexical adaptation across various scenarios and additive gains of up to 60% WERR in one entity-rich scenario through a combination of class-based n-gram and neural LMs.