CLOct 27, 2020

Multitask Training with Text Data for End-to-End Speech Recognition

arXiv:2010.14318v231 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of enhancing speech recognition accuracy for real-world applications by incorporating language information, though it is incremental as it builds on existing attention-based models.

The paper tackles the problem of improving end-to-end speech recognition by proposing a multitask training method that regularizes the decoder using both audio-text and text-only data, resulting in an 11% relative performance improvement over the baseline on the 100-hour LibriSpeech subset without needing an additional language model.

We propose a multitask training method for attention-based end-to-end speech recognition models. We regularize the decoder in a listen, attend, and spell model by multitask training it on both audio-text and text-only data. Trained on the 100-hour subset of LibriSpeech, the proposed method, without requiring an additional language model, leads to an 11% relative performance improvement over the baseline and approaches the performance of language model shallow fusion on the test-clean evaluation set. We observe a similar trend on the whole 960-hour LibriSpeech training set. Analyses of different types of errors and sample output sentences demonstrate that the proposed method can incorporate language level information, suggesting its effectiveness in real-world applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes