ASAIMar 25, 2022

Spatial Processing Front-End For Distant ASR Exploiting Self-Attention Channel Combinator

arXiv:2203.13919v17 citationsh-index: 40
Originality Incremental advance
AI Analysis

This work addresses the challenge of improving ASR accuracy in reverberant, multi-channel environments, representing an incremental advancement with specific gains in performance.

The authors tackled the problem of distant automatic speech recognition (ASR) by proposing a multi-channel front-end combining channel shortening, beamforming, and a self-attention-based channel combinator, resulting in a 21.6% relative reduction in word error rate on a multi-channel dataset and a 13.6 dB improvement in dereverberation.

We present a novel multi-channel front-end based on channel shortening with theWeighted Prediction Error (WPE) method followed by a fixed MVDR beamformer used in combination with a recently proposed self-attention-based channel combination (SACC) scheme, for tackling the distant ASR problem. We show that the proposed system used as part of a ContextNet based end-to-end (E2E) ASR system outperforms leading ASR systems as demonstrated by a 21.6% reduction in relative WER on a multi-channel LibriSpeech playback dataset. We also show how dereverberation prior to beamforming is beneficial and compare the WPE method with a modified neural channel shortening approach. An analysis of the non-intrusive estimate of the signal C50 confirms that the 8 channel WPE method provides significant dereverberation of the signals (13.6 dB improvement). We also show how the weights of the SACC system allow the extraction of accurate spatial information which can be beneficial for other speech processing applications like diarization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes