AS SDApr 21, 2021

Scene-aware Far-field Automatic Speech Recognition

arXiv:2104.10757v12 citations

Originality Incremental advance

AI Analysis

This work addresses far-field speech recognition for applications in noisy environments, but it is incremental as it builds on existing data augmentation methods.

The paper tackles the problem of far-field automatic speech recognition by generating scene-aware training data, resulting in a 2.64% absolute improvement in word error rate compared to using uniformly distributed data.

We propose a novel method for generating scene-aware training data for far-field automatic speech recognition. We use a deep learning-based estimator to non-intrusively compute the sub-band reverberation time of an environment from its speech samples. We model the acoustic characteristics of a scene with its reverberation time and represent it using a multivariate Gaussian distribution. We use this distribution to select acoustic impulse responses from a large real-world dataset for augmenting speech data. The speech recognition system trained on our scene-aware data consistently outperforms the system trained using many more random acoustic impulse responses on the REVERB and the AMI far-field benchmarks. In practice, we obtain 2.64% absolute improvement in word error rate compared with using training data of the same size with uniformly distributed reverberation times.

View on arXiv PDF

Similar