SDASNov 7, 2020

Dual Application of Speech Enhancement for Automatic Speech Recognition

arXiv:2011.03840v147 citations
AI Analysis

This work addresses the challenge of noisy speech recognition in social media videos, which is an incremental improvement over existing methods.

The paper tackles the problem of improving automatic speech recognition (ASR) accuracy by applying speech enhancement in two ways: as a data augmentation technique and as a preprocessing frontend, achieving average relative improvements of 11.2%, 8.3%, and 13.4% on a social media English video dataset.

In this work, we exploit speech enhancement for improving a recurrent neural network transducer (RNN-T) based ASR system. We employ a dense convolutional recurrent network (DCRN) for complex spectral mapping based speech enhancement, and find it helpful for ASR in two ways: a data augmentation technique, and a preprocessing frontend. In using it for ASR data augmentation, we exploit a KL divergence based consistency loss that is computed between the ASR outputs of original and enhanced utterances. In using speech enhancement as an effective ASR frontend, we propose a three-step training scheme based on model pretraining and feature selection. We evaluate our proposed techniques on a challenging social media English video dataset, and achieve an average relative improvement of 11.2% with speech enhancement based data augmentation, 8.3% with enhancement based preprocessing, and 13.4% when combining both.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes