LGCVOct 17, 2022

A Treatise On FST Lattice Based MMI Training

arXiv:2210.08918v1h-index: 16
Originality Incremental advance
AI Analysis

This work addresses efficiency and accuracy issues in speech recognition training for tasks like dictation and assistants, though it is incremental as it refines an existing framework.

The paper tackles the implicit modeling decisions in finite state transducer (FST) lattice-based maximum mutual information (MMI) training for speech recognition, showing that on-the-fly determinization of denominator lattices improves discrimination. It achieves 2.3-4.6% relative word error rate reduction on Mandarin and English datasets.

Maximum mutual information (MMI) has become one of the two de facto methods for sequence-level training of speech recognition acoustic models. This paper aims to isolate, identify and bring forward the implicit modelling decisions induced by the design implementation of standard finite state transducer (FST) lattice based MMI training framework. The paper particularly investigates the necessity to maintain a preselected numerator alignment and raises the importance of determinizing FST denominator lattices on the fly. The efficacy of employing on the fly FST lattice determinization is mathematically shown to guarantee discrimination at the hypothesis level and is empirically shown through training deep CNN models on a 18K hours Mandarin dataset and on a 2.8K hours English dataset. On assistant and dictation tasks, the approach achieves between 2.3-4.6% relative WER reduction (WERR) over the standard FST lattice based approach.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes