CLNov 3, 2017

Learning Filterbanks from Raw Speech for Phone Recognition

arXiv:1711.01161v2130 citations
Originality Incremental advance
AI Analysis

This work addresses speech recognition challenges by introducing a learnable front-end, though it is incremental as it builds on existing filterbank methods.

The paper tackled the problem of phone recognition by learning time-domain filterbanks from raw speech, which consistently outperformed traditional mel-filterbanks on TIMIT, achieving improved performance across several architectures.

We train a bank of complex filters that operates on the raw waveform and is fed into a convolutional neural network for end-to-end phone recognition. These time-domain filterbanks (TD-filterbanks) are initialized as an approximation of mel-filterbanks, and then fine-tuned jointly with the remaining convolutional architecture. We perform phone recognition experiments on TIMIT and show that for several architectures, models trained on TD-filterbanks consistently outperform their counterparts trained on comparable mel-filterbanks. We get our best performance by learning all front-end steps, from pre-emphasis up to averaging. Finally, we observe that the filters at convergence have an asymmetric impulse response, and that some of them remain almost analytic.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes