CLFeb 14, 2017

On the Relevance of Auditory-Based Gabor Features for Deep Learning in Automatic Speech Recognition

Angel Mario Castro Martinez, Sri Harish Mallidi, Bernd T. Meyer

arXiv:1702.04333v10.321 citations

Originality Incremental advance

AI Analysis

This work addresses robust speech recognition in noisy environments, but it is incremental as it builds on prior studies to validate a hypothesis about feature discriminability.

The paper tackled the problem of understanding why auditory-based Gabor features improve deep learning in automatic speech recognition, finding that high temporal modulation frequencies (16-25 Hz) consistently outperformed others with relative improvements of 11-56% against a baseline across three tasks.

Previous studies support the idea of merging auditory-based Gabor features with deep learning architectures to achieve robust automatic speech recognition, however, the cause behind the gain of such combination is still unknown. We believe these representations provide the deep learning decoder with more discriminable cues. Our aim with this paper is to validate this hypothesis by performing experiments with three different recognition tasks (Aurora 4, CHiME 2 and CHiME 3) and assess the discriminability of the information encoded by Gabor filterbank features. Additionally, to identify the contribution of low, medium and high temporal modulation frequencies subsets of the Gabor filterbank were used as features (dubbed LTM, MTM and HTM respectively). With temporal modulation frequencies between 16 and 25 Hz, HTM consistently outperformed the remaining ones in every condition, highlighting the robustness of these representations against channel distortions, low signal-to-noise ratios and acoustically challenging real-life scenarios with relative improvements from 11 to 56% against a Mel-filterbank-DNN baseline. To explain the results, a measure of similarity between phoneme classes from DNN activations is proposed and linked to their acoustic properties. We find this measure to be consistent with the observed error rates and highlight specific differences on phoneme level to pinpoint the benefit of the proposed features.

View on arXiv PDF

Similar