Deep CLAS: Deep Contextual Listen, Attend and Spell
This work improves ASR accuracy for rare words, which is important for applications like named entity recognition, but it is incremental as it builds on existing CLAS methods.
The paper tackled the problem of insufficient use of contextual information in Contextual-LAS for ASR, particularly for rare words, by proposing deep CLAS with bias loss and character-level encoding, resulting in a 65.78% relative recall and 53.49% relative F1-score increase on AISHELL-1.
Contextual-LAS (CLAS) has been shown effective in improving Automatic Speech Recognition (ASR) of rare words. It relies on phrase-level contextual modeling and attention-based relevance scoring without explicit contextual constraint which lead to insufficient use of contextual information. In this work, we propose deep CLAS to use contextual information better. We introduce bias loss forcing model to focus on contextual information. The query of bias attention is also enriched to improve the accuracy of the bias attention score. To get fine-grained contextual information, we replace phrase-level encoding with character-level encoding and encode contextual information with conformer rather than LSTM. Moreover, we directly use the bias attention score to correct the output probability distribution of the model. Experiments using the public AISHELL-1 and AISHELL-NER. On AISHELL-1, compared to CLAS baselines, deep CLAS obtains a 65.78% relative recall and a 53.49% relative F1-score increase in the named entity recognition scene.