CLSDASJun 16, 2021

Collaborative Training of Acoustic Encoders for Speech Recognition

arXiv:2106.08960v212 citations
AI Analysis

This work addresses the need for efficient and accurate on-device speech recognition models across different computational budgets, though it is incremental as it builds on existing transducer and distillation techniques.

The paper tackles the problem of training multiple speech recognition models of varying sizes for on-device deployment by proposing a collaborative training method that uses shared modules and co-distillation, resulting in up to an 11% relative improvement in word error rate on LibriSpeech test partitions.

On-device speech recognition requires training models of different sizes for deploying on devices with various computational budgets. When building such different models, we can benefit from training them jointly to take advantage of the knowledge shared between them. Joint training is also efficient since it reduces the redundancy in the training procedure's data handling operations. We propose a method for collaboratively training acoustic encoders of different sizes for speech recognition. We use a sequence transducer setup where different acoustic encoders share a common predictor and joiner modules. The acoustic encoders are also trained using co-distillation through an auxiliary task for frame level chenone prediction, along with the transducer loss. We perform experiments using the LibriSpeech corpus and demonstrate that the collaboratively trained acoustic encoders can provide up to a 11% relative improvement in the word error rate on both the test partitions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes