Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR
This work addresses improving ASR accuracy for speech processing applications by incrementally enhancing contextual modeling with SSL features.
The paper tackled improving contextual automatic speech recognition (ASR) by using self-supervised learning (SSL) discrete speech features as cross-utterance acoustic context in Zipformer-Transducer systems, achieving statistically significant word error rate reductions of 0.32% to 0.41% absolute (2.78% to 3.54% relative) and setting new lowest published WERs of 11.15% and 11.14% on dev and test sets.
Self-supervised learning (SSL) based discrete speech representations are highly compact and domain adaptable. In this paper, SSL discrete speech features extracted from WavLM models are used as additional cross-utterance acoustic context features in Zipformer-Transducer ASR systems. The efficacy of replacing Fbank features with discrete token features for modelling either cross-utterance contexts (from preceding and future segments), or current utterance's internal contexts alone, or both at the same time, are demonstrated thoroughly on the Gigaspeech 1000-hr corpus. The best Zipformer-Transducer system using discrete tokens based cross-utterance context features outperforms the baseline using utterance internal context only with statistically significant word error rate (WER) reductions of 0.32% to 0.41% absolute (2.78% to 3.54% relative) on the dev and test data. The lowest published WER of 11.15% and 11.14% were obtained on the dev and test sets. Our work is open-source and publicly available at https://github.com/open-creator/icefall/tree/master/egs/gigaspeech/Context\_ASR.