SDASNov 18, 2020

Multi-Channel Automatic Speech Recognition Using Deep Complex Unet

arXiv:2011.09081v19 citations
AI Analysis

This work provides a significant improvement in speech recognition accuracy for smart speaker users experiencing noisy and echo-prone environments, especially for the XiaoMi platform.

This paper addresses the challenge of multi-channel automatic speech recognition (ASR) in noisy environments with reverberation and echoes. The authors propose using a deep complex Unet (DCUnet) as the front-end in a multi-task learning (MTL) framework, achieving a 12.2% relative character error rate (CER) reduction on 1000 hours of real-world XiaoMi smart speaker data compared to traditional array processing.

The front-end module in multi-channel automatic speech recognition (ASR) systems mainly use microphone array techniques to produce enhanced signals in noisy conditions with reverberation and echos. Recently, neural network (NN) based front-end has shown promising improvement over the conventional signal processing methods. In this paper, we propose to adopt the architecture of deep complex Unet (DCUnet) - a powerful complex-valued Unet-structured speech enhancement model - as the front-end of the multi-channel acoustic model, and integrate them in a multi-task learning (MTL) framework along with cascaded framework for comparison. Meanwhile, we investigate the proposed methods with several training strategies to improve the recognition accuracy on the 1000-hours real-world XiaoMi smart speaker data with echos. Experiments show that our proposed DCUnet-MTL method brings about 12.2% relative character error rate (CER) reduction compared with the traditional approach with array processing plus single-channel acoustic model. It also achieves superior performance than the recently proposed neural beamforming method.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes