CLNov 2, 2020

DNN-Based Semantic Model for Rescoring N-best Speech Recognition List

arXiv:2011.00975v11 citations
AI Analysis

This work addresses ASR robustness for noisy environments, offering an incremental improvement by integrating semantic modeling into rescoring.

The paper tackles the problem of increased word error rate (WER) in automatic speech recognition (ASR) under noisy conditions by proposing a DNN-based rescoring model that uses semantic features to select better hypotheses from an N-best list, resulting in significant WER improvements over baseline systems in noisy scenarios.

The word error rate (WER) of an automatic speech recognition (ASR) system increases when a mismatch occurs between the training and the testing conditions due to the noise, etc. In this case, the acoustic information can be less reliable. This work aims to improve ASR by modeling long-term semantic relations to compensate for distorted acoustic features. We propose to perform this through rescoring of the ASR N-best hypotheses list. To achieve this, we train a deep neural network (DNN). Our DNN rescoring model is aimed at selecting hypotheses that have better semantic consistency and therefore lower WER. We investigate two types of representations as part of input features to our DNN model: static word embeddings (from word2vec) and dynamic contextual embeddings (from BERT). Acoustic and linguistic features are also included. We perform experiments on the publicly available dataset TED-LIUM mixed with real noise. The proposed rescoring approaches give significant improvement of the WER over the ASR system without rescoring models in two noisy conditions and with n-gram and RNNLM.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes