ASCLLGSDApr 24, 2022

Improved far-field speech recognition using Joint Variational Autoencoder

arXiv:2204.11286v11 citationsh-index: 9
Originality Incremental advance
AI Analysis

This work addresses the challenge of noise and reverberation in automatic speech recognition systems, particularly in matched training scenarios, offering incremental improvements for speech processing applications.

The paper tackled the problem of far-field speech recognition by proposing a joint Variational Autoencoder (VAE) for mapping speech features, achieving a 2.5% absolute improvement in word error rate (WER) over denoising autoencoder methods and 3.96% over baseline acoustic models trained on far-field features.

Automatic Speech Recognition (ASR) systems suffer considerably when source speech is corrupted with noise or room impulse responses (RIR). Typically, speech enhancement is applied in both mismatched and matched scenario training and testing. In matched setting, acoustic model (AM) is trained on dereverberated far-field features while in mismatched setting, AM is fixed. In recent past, mapping speech features from far-field to close-talk using denoising autoencoder (DA) has been explored. In this paper, we focus on matched scenario training and show that the proposed joint VAE based mapping achieves a significant improvement over DA. Specifically, we observe an absolute improvement of 2.5% in word error rate (WER) compared to DA based enhancement and 3.96% compared to AM trained directly on far-field filterbank features.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes