SDCLASNov 17, 2022

LongFNT: Long-form Speech Recognition with Factorized Neural Transducer

arXiv:2211.09412v117 citationsh-index: 57
Originality Incremental advance
AI Analysis

This work addresses the practical challenge of improving speech recognition accuracy for long-form audio, which is incremental by building on existing neural transducer methods.

The paper tackles the problem of long-form speech recognition by proposing LongFNT, a factorized neural transducer architecture that incorporates historical context, achieving 19% and 12% relative word error rate reductions on LibriSpeech and GigaSpeech corpora, respectively.

Traditional automatic speech recognition~(ASR) systems usually focus on individual utterances, without considering long-form speech with useful historical information, which is more practical in real scenarios. Simply attending longer transcription history for a vanilla neural transducer model shows no much gain in our preliminary experiments, since the prediction network is not a pure language model. This motivates us to leverage the factorized neural transducer structure, containing a real language model, the vocabulary predictor. We propose the {LongFNT-Text} architecture, which fuses the sentence-level long-form features directly with the output of the vocabulary predictor and then embeds token-level long-form features inside the vocabulary predictor, with a pre-trained contextual encoder RoBERTa to further boost the performance. Moreover, we propose the {LongFNT} architecture by extending the long-form speech to the original speech input and achieve the best performance. The effectiveness of our LongFNT approach is validated on LibriSpeech and GigaSpeech corpora with 19% and 12% relative word error rate~(WER) reduction, respectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes