Multi-view Frequency LSTM: An Efficient Frontend for Automatic Speech Recognition
This work addresses a specific bottleneck in acoustic modeling for real-time automatic speech recognition systems, offering an incremental improvement for applications like far-field speech processing.
The paper tackled the limitation of frequency-LSTM (FLSTM) architectures in speech recognition, which operate at a fixed window size, by proposing a multi-view FLSTM that combines outputs from different views to model a wider range of time-frequency correlations, resulting in relative Word Error Rate improvements of 3-7% over a single FLSTM model while maintaining similar computational costs.
Acoustic models in real-time speech recognition systems typically stack multiple unidirectional LSTM layers to process the acoustic frames over time. Performance improvements over vanilla LSTM architectures have been reported by prepending a stack of frequency-LSTM (FLSTM) layers to the time LSTM. These FLSTM layers can learn a more robust input feature to the time LSTM layers by modeling time-frequency correlations in the acoustic input signals. A drawback of FLSTM based architectures however is that they operate at a predefined, and tuned, window size and stride, referred to as 'view' in this paper. We present a simple and efficient modification by combining the outputs of multiple FLSTM stacks with different views, into a dimensionality reduced feature representation. The proposed multi-view FLSTM architecture allows to model a wider range of time-frequency correlations compared to an FLSTM model with single view. When trained on 50K hours of English far-field speech data with CTC loss followed by sMBR sequence training, we show that the multi-view FLSTM acoustic model provides relative Word Error Rate (WER) improvements of 3-7% for different speaker and acoustic environment scenarios over an optimized single FLSTM model, while retaining a similar computational footprint.