Consistent Training and Decoding For End-to-end Speech Recognition Using Lattice-free MMI
This work addresses performance improvements in automatic speech recognition for applications requiring high accuracy, though it is incremental by adapting an existing training criterion to new frameworks.
The paper tackles the problem of integrating Lattice-Free Maximum Mutual Information (LF-MMI) into end-to-end speech recognition frameworks, achieving a competitive character error rate of 4.1%/4.4% on the Aishell-1 dataset and significant error reductions on other datasets.
Recently, End-to-End (E2E) frameworks have achieved remarkable results on various Automatic Speech Recognition (ASR) tasks. However, Lattice-Free Maximum Mutual Information (LF-MMI), as one of the discriminative training criteria that show superior performance in hybrid ASR systems, is rarely adopted in E2E ASR frameworks. In this work, we propose a novel approach to integrate LF-MMI criterion into E2E ASR frameworks in both training and decoding stages. The proposed approach shows its effectiveness on two of the most widely used E2E frameworks including Attention-Based Encoder-Decoders (AEDs) and Neural Transducers (NTs). Experiments suggest that the introduction of the LF-MMI criterion consistently leads to significant performance improvements on various datasets and different E2E ASR frameworks. The best of our models achieves competitive CER of 4.1\% / 4.4\% on Aishell-1 dev/test set; we also achieve significant error reduction on Aishell-2 and Librispeech datasets over strong baselines.