Sequence-Level Knowledge Distillation for Model Compression of Attention-based Sequence-to-Sequence Speech Recognition
This addresses the problem of reducing model size for speech recognition systems, which is incremental as it applies an existing distillation technique to a specific domain.
The paper tackled model compression for attention-based sequence-to-sequence speech recognition by using sequence-level knowledge distillation from a teacher model to a student model, achieving up to 9.8x parameter reduction with a word-error rate increase of up to 7.0% on the WSJ corpus.
We investigate the feasibility of sequence-level knowledge distillation of Sequence-to-Sequence (Seq2Seq) models for Large Vocabulary Continuous Speech Recognition (LVSCR). We first use a pre-trained larger teacher model to generate multiple hypotheses per utterance with beam search. With the same input, we then train the student model using these hypotheses generated from the teacher as pseudo labels in place of the original ground truth labels. We evaluate our proposed method using Wall Street Journal (WSJ) corpus. It achieved up to $ 9.8 \times$ parameter reduction with accuracy loss of up to 7.0\% word-error rate (WER) increase