AS CL SDOct 29, 2018

Cascaded CNN-resBiLSTM-CTC: An End-to-End Acoustic Model For Speech Recognition

arXiv:1810.12001v22.31 citations

Originality Incremental advance

AI Analysis

This work addresses speech recognition efficiency and accuracy for ASR applications, presenting incremental improvements through architectural modifications and training optimizations.

The authors tackled speech recognition by proposing a cascaded CNN-resBiLSTM-CTC model that incorporates residual blocks in BiLSTM layers and a cascaded structure for hard negative samples, achieving a 3.41% word error rate on LibriSpeech test clean corpora and reducing training time by 25% with a batch-varied method.

Automatic speech recognition (ASR) tasks are resolved by end-to-end deep learning models, which benefits us by less preparation of raw data, and easier transformation between languages. We propose a novel end-to-end deep learning model architecture namely cascaded CNN-resBiLSTM-CTC. In the proposed model, we add residual blocks in BiLSTM layers to extract sophisticated phoneme and semantic information together, and apply cascaded structure to pay more attention mining information of hard negative samples. By applying both simple Fast Fourier Transform (FFT) technique and n-gram language model (LM) rescoring method, we manage to achieve word error rate (WER) of 3.41% on LibriSpeech test clean corpora. Furthermore, we propose a new batch-varied method to speed up the training process in length-varied tasks, which result in 25% less training time.

View on arXiv PDF

Similar