LGCLSDASOct 20, 2021

Knowledge distillation from language model to acoustic model: a hierarchical multi-task learning approach

arXiv:2110.10429v14 citations
Originality Incremental advance
AI Analysis

This work addresses speech recognition systems by leveraging cross-modal knowledge distillation, representing an incremental improvement over existing methods.

The paper tackles the problem of improving speech recognition by transferring knowledge from a pre-trained language model to an acoustic model, proposing a hierarchical multi-task learning approach that effectively compensates for existing distillation methods and shows effectiveness through ablation studies.

The remarkable performance of the pre-trained language model (LM) using self-supervised learning has led to a major paradigm shift in the study of natural language processing. In line with these changes, leveraging the performance of speech recognition systems with massive deep learning-based LMs is a major topic of speech recognition research. Among the various methods of applying LMs to speech recognition systems, in this paper, we focus on a cross-modal knowledge distillation method that transfers knowledge between two types of deep neural networks with different modalities. We propose an acoustic model structure with multiple auxiliary output layers for cross-modal distillation and demonstrate that the proposed method effectively compensates for the shortcomings of the existing label-interpolation-based distillation method. In addition, we extend the proposed method to a hierarchical distillation method using LMs trained in different units (senones, monophones, and subwords) and reveal the effectiveness of the hierarchical distillation method through an ablation study.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes