AS CLJun 23, 2023

Towards Effective and Compact Contextual Representation for Conformer Transducer Speech Recognition Systems

Mingyu Cui, Jiawen Kang, Jiajun Deng, Xi Yin, Yutao Xie, Xie Chen, Xunying Liu

arXiv:2306.13307v28.010 citationsh-index: 37

Originality Incremental advance

AI Analysis

This addresses the challenge of improving streaming speech recognition accuracy by efficiently using historical context, though it is incremental as it builds on existing Conformer-Transducer methods.

The paper tackled the problem of incorporating long-range cross-utterance context in ASR systems by learning compact low-dimensional contextual features in a Conformer-Transducer Encoder, resulting in statistically significant WER reductions of 0.7% to 0.5% absolute (4.3% to 3.1% relative) on dev and test data.

Current ASR systems are mainly trained and evaluated at the utterance level. Long range cross utterance context can be incorporated. A key task is to derive a suitable compact representation of the most relevant history contexts. In contrast to previous researches based on either LSTM-RNN encoded histories that attenuate the information from longer range contexts, or frame level concatenation of transformer context embeddings, in this paper compact low-dimensional cross utterance contextual features are learned in the Conformer-Transducer Encoder using specially designed attention pooling layers that are applied over efficiently cached preceding utterances history vectors. Experiments on the 1000-hr Gigaspeech corpus demonstrate that the proposed contextualized streaming Conformer-Transducers outperform the baseline using utterance internal context only with statistically significant WER reductions of 0.7% to 0.5% absolute (4.3% to 3.1% relative) on the dev and test data.

View on arXiv PDF

Similar