ASLGSDSPOct 27, 2022

Contextual-Utterance Training for Automatic Speech Recognition

arXiv:2210.16238v11 citationsh-index: 26
Originality Incremental advance
AI Analysis

This addresses accuracy and latency issues in streaming ASR systems, offering incremental improvements for real-time speech processing applications.

The paper tackles improving streaming automatic speech recognition (ASR) by proposing contextual-utterance training techniques that use past and future utterances for implicit adaptation, reducing word error rate (WER) by over 6% and latency by over 40ms relative.

Recent studies of streaming automatic speech recognition (ASR) recurrent neural network transducer (RNN-T)-based systems have fed the encoder with past contextual information in order to improve its word error rate (WER) performance. In this paper, we first propose a contextual-utterance training technique which makes use of the previous and future contextual utterances in order to do an implicit adaptation to the speaker, topic and acoustic environment. Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems. This proposed approach allows to make a better use of the available acoustic context in streaming models by distilling "in-place" the knowledge of a teacher, which is able to see both past and future contextual utterances, to the student which can only see the current and past contextual utterances. The experimental results show that a conformer-transducer system trained with the proposed techniques outperforms the same system trained with the classical RNN-T loss. Specifically, the proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative, respectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes