ASAICLLGSDFeb 16, 2023

JEIT: Joint End-to-End Model and Internal Language Model Training for Speech Recognition

arXiv:2302.08583v113 citationsh-index: 69
Originality Highly original
AI Analysis

This addresses speech recognition challenges for rare words, offering a novel training approach without separate adaptation steps.

The paper tackles the problem of rare-word speech recognition by proposing JEIT, a joint training method that injects large-scale unpaired text into an internal language model during end-to-end training, resulting in up to 16.4% improvement in rare-word recognition accuracy with 100B unpaired sentences.

We propose JEIT, a joint end-to-end (E2E) model and internal language model (ILM) training method to inject large-scale unpaired text into ILM during E2E training which improves rare-word speech recognition. With JEIT, the E2E model computes an E2E loss on audio-transcript pairs while its ILM estimates a cross-entropy loss on unpaired text. The E2E model is trained to minimize a weighted sum of E2E and ILM losses. During JEIT, ILM absorbs knowledge from unpaired text while the E2E training serves as regularization. Unlike ILM adaptation methods, JEIT does not require a separate adaptation step and avoids the need for Kullback-Leibler divergence regularization of ILM. We also show that modular hybrid autoregressive transducer (MHAT) performs better than HAT in the JEIT framework, and is much more robust than HAT during ILM adaptation. To push the limit of unpaired text injection, we further propose a combined JEIT and JOIST training (CJJT) that benefits from modality matching, encoder text injection and ILM training. Both JEIT and CJJT can foster a more effective LM fusion. With 100B unpaired sentences, JEIT/CJJT improves rare-word recognition accuracy by up to 16.4% over a model trained without unpaired text.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes