AS CL LG SDJun 12, 2023

Multi-View Frequency-Attention Alternative to CNN Frontends for Automatic Speech Recognition

Belen Alastruey, Lukas Drude, Jahn Heymann, Simon Wiesler

arXiv:2306.06954v11.21 citationsh-index: 26

Originality Incremental advance

AI Analysis

This addresses speech recognition accuracy for applications like Alexa, though it is incremental as it modifies an existing frontend component.

The paper tackles the problem of improving automatic speech recognition by replacing convolutional frontends with a frequency-attention module, achieving a 2.4% relative word error rate reduction on production-scale data and 4.6% on public benchmarks.

Convolutional frontends are a typical choice for Transformer-based automatic speech recognition to preprocess the spectrogram, reduce its sequence length, and combine local information in time and frequency similarly. However, the width and height of an audio spectrogram denote different information, e.g., due to reverberation as well as the articulatory system, the time axis has a clear left-to-right dependency. On the contrary, vowels and consonants demonstrate very different patterns and occupy almost disjoint frequency ranges. Therefore, we hypothesize, global attention over frequencies is beneficial over local convolution. We obtain 2.4 % relative word error rate reduction (rWERR) on a production scale Conformer transducer replacing its convolutional neural network frontend by the proposed F-Attention module on Alexa traffic. To demonstrate generalizability, we validate this on public LibriSpeech data with a long short term memory-based listen attend and spell architecture obtaining 4.6 % rWERR and demonstrate robustness to (simulated) noisy conditions.

View on arXiv PDF

Similar