ASSDMar 25, 2021

Radically Old Way of Computing Spectra: Applications in End-to-End ASR

arXiv:2103.14129v27 citations
AI Analysis

This work addresses robustness issues in automatic speech recognition for realistic, noisy environments, representing an incremental improvement over existing feature extraction methods.

The paper tackles the problem of improving speech recognition robustness by proposing a Frequency Domain Linear Prediction (FDLP) technique to compute spectrograms, which captures low-frequency temporal modulations using a 1.5-second context window. The result shows that FDLP spectrograms perform on par with standard mel spectrograms for clean speech but achieve up to 25% and 22% relative WER improvements in scenarios with train-test domain mismatches or reverberations.

We propose a technique to compute spectrograms using Frequency Domain Linear Prediction (FDLP) that uses all-pole models to fit the squared Hilbert envelope of speech in different frequency sub-bands. The spectrogram of a complete speech utterance is computed by overlap-add of contiguous all-pole model responses. A long context window of 1.5 seconds allows us to capture the low frequency temporal modulations of speech in the spectrogram. For an end-to-end automatic speech recognition task, the FDLP spectrogram performs on par with the standard mel spectrogram features for clean read speech training and test data. For more realistic speech data with train-test domain mismatches or reverberations, FDLP spectrogram shows up to 25% and 22% relative WER improvements over mel spectrogram respectively.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes