CL SD ASSep 13, 2024

Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR

Mingyu Cui, Yifan Yang, Jiajun Deng, Jiawen Kang, Shujie Hu, Tianzi Wang, Zhaoqing Li, Shiliang Zhang, Xie Chen, Xunying Liu

arXiv:2409.08797v23.44 citationsh-index: 15Has Code

Originality Incremental advance

AI Analysis

This work addresses improving ASR accuracy for speech processing applications by incrementally enhancing contextual modeling with SSL features.

The paper tackled improving contextual automatic speech recognition (ASR) by using self-supervised learning (SSL) discrete speech features as cross-utterance acoustic context in Zipformer-Transducer systems, achieving statistically significant word error rate reductions of 0.32% to 0.41% absolute (2.78% to 3.54% relative) and setting new lowest published WERs of 11.15% and 11.14% on dev and test sets.

Self-supervised learning (SSL) based discrete speech representations are highly compact and domain adaptable. In this paper, SSL discrete speech features extracted from WavLM models are used as additional cross-utterance acoustic context features in Zipformer-Transducer ASR systems. The efficacy of replacing Fbank features with discrete token features for modelling either cross-utterance contexts (from preceding and future segments), or current utterance's internal contexts alone, or both at the same time, are demonstrated thoroughly on the Gigaspeech 1000-hr corpus. The best Zipformer-Transducer system using discrete tokens based cross-utterance context features outperforms the baseline using utterance internal context only with statistically significant word error rate (WER) reductions of 0.32% to 0.41% absolute (2.78% to 3.54% relative) on the dev and test data. The lowest published WER of 11.15% and 11.14% were obtained on the dev and test sets. Our work is open-source and publicly available at https://github.com/open-creator/icefall/tree/master/egs/gigaspeech/Context\_ASR.

View on arXiv PDF Code

Similar