CL ASAug 31, 2023

Knowledge Distillation from Non-streaming to Streaming ASR Encoder using Auxiliary Non-streaming Layer

Kyuhong Shim, Jinkyu Lee, Simyung Chang, Kyuwoong Hwang

arXiv:2308.16415v10.53 citationsh-index: 29

Originality Incremental advance

AI Analysis

This work addresses the problem of improving streaming ASR accuracy for real-time applications, though it is incremental as it builds on existing knowledge distillation techniques.

The paper tackles the performance gap between streaming and non-streaming automatic speech recognition models by proposing a layer-to-layer knowledge distillation method using auxiliary non-streaming layers and a special loss based on autoregressive predictive coding, which significantly reduces word error rate compared to previous methods.

Streaming automatic speech recognition (ASR) models are restricted from accessing future context, which results in worse performance compared to the non-streaming models. To improve the performance of streaming ASR, knowledge distillation (KD) from the non-streaming to streaming model has been studied, mainly focusing on aligning the output token probabilities. In this paper, we propose a layer-to-layer KD from the teacher encoder to the student encoder. To ensure that features are extracted using the same context, we insert auxiliary non-streaming branches to the student and perform KD from the non-streaming teacher layer to the non-streaming auxiliary layer. We design a special KD loss that leverages the autoregressive predictive coding (APC) mechanism to encourage the streaming model to predict unseen future contexts. Experimental results show that the proposed method can significantly reduce the word error rate compared to previous token probability distillation methods.

View on arXiv PDF

Similar