AS LG SDNov 13, 2019

3-D Feature and Acoustic Modeling for Far-Field Speech Recognition

Anurenjan Purushothaman, Anirudh Sreeram, Sriram Ganapathy

arXiv:1911.05504v21.2

Originality Highly original

AI Analysis

This work addresses the problem of robust speech recognition in noisy, reverberant environments for applications like voice assistants, offering a novel method that improves over traditional beamforming.

The paper tackled automatic speech recognition in multi-channel reverberant conditions by proposing a 3-D feature and acoustic modeling approach that directly extracts features from multi-channel signals using multivariate autoregressive modeling and a convolutional neural network, resulting in average relative improvements of 10% and 9% in word error rates on CHiME-3 and REVERB Challenge datasets compared to beamforming-based systems.

Automatic speech recognition in multi-channel reverberant conditions is a challenging task. The conventional way of suppressing the reverberation artifacts involves a beamforming based enhancement of the multi-channel speech signal, which is used to extract spectrogram based features for a neural network acoustic model. In this paper, we propose to extract features directly from the multi-channel speech signal using a multi variate autoregressive (MAR) modeling approach, where the correlations among all the three dimensions of time, frequency and channel are exploited. The MAR features are fed to a convolutional neural network (CNN) architecture which performs the joint acoustic modeling on the three dimensions. The 3-D CNN architecture allows the combination of multi-channel features that optimize the speech recognition cost compared to the traditional beamforming models that focus on the enhancement task. Experiments are conducted on the CHiME-3 and REVERB Challenge dataset using multi-channel reverberant speech. In these experiments, the proposed 3-D feature and acoustic modeling approach provides significant improvements over an ASR system trained with beamformed audio (average relative improvements of 10 % and 9 % in word error rates for CHiME-3 and REVERB Challenge datasets respectively.

View on arXiv PDF

Similar