SDASNov 26, 2020

Improving RNN Transducer With Target Speaker Extraction and Neural Uncertainty Estimation

arXiv:2011.13393v22 citations
AI Analysis

This work provides an incremental improvement for speech recognition systems operating in noisy, multi-speaker environments, benefiting users who need accurate transcription in challenging acoustic conditions.

This paper addresses target-speaker speech recognition in noisy environments by integrating time-domain target-speaker speech extraction with a Recurrent Neural Network Transducer (RNN-T). The proposed method, which includes neural uncertainty estimation, achieves a 17% relative Character Error Rate (CER) reduction on multi-speaker signals with background noise and a 9% relative performance gain in noisy conditions.

Target-speaker speech recognition aims to recognize target-speaker speech from noisy environments with background noise and interfering speakers. This work presents a joint framework that combines time-domain target-speaker speech extraction and Recurrent Neural Network Transducer (RNN-T). To stabilize the joint-training, we propose a multi-stage training strategy that pre-trains and fine-tunes each module in the system before joint-training. Meanwhile, speaker identity and speech enhancement uncertainty measures are proposed to compensate for residual noise and artifacts from the target speech extraction module. Compared to a recognizer fine-tuned with a target speech extraction model, our experiments show that adding the neural uncertainty module significantly reduces 17% relative Character Error Rate (CER) on multi-speaker signals with background noise. The multi-condition experiments indicate that our method can achieve 9% relative performance gain in the noisy condition while maintaining the performance in the clean condition.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes