CLSDASSep 20, 2024

Fast Streaming Transducer ASR Prototyping via Knowledge Distillation with Whisper

arXiv:2409.13499v223 citationsh-index: 31
AI Analysis

This addresses the problem of limited supervised data for ASR training, offering a more accessible and efficient approach, though it is incremental as it builds on existing knowledge distillation and transducer methods.

The paper tackled training streaming Transformer-Transducer ASR models without supervised data by using pseudo-labels from foundational speech models, enabling training from scratch on consumer GPUs in one stage and reducing data and computational needs compared to pre-training and fine-tuning.

The training of automatic speech recognition (ASR) with little to no supervised data remains an open question. In this work, we demonstrate that streaming Transformer-Transducer (TT) models can be trained from scratch in consumer and accessible GPUs in their entirety with pseudo-labeled (PL) speech from foundational speech models (FSM). This allows training a robust ASR model just in one stage and does not require large data and computational budget compared to the two-step scenario with pre-training and fine-tuning. We perform a comprehensive ablation on different aspects of PL-based streaming TT models such as the impact of (1) shallow fusion of n-gram LMs, (2) contextual biasing with named entities, (3) chunk-wise decoding for low-latency streaming applications, and (4) TT overall performance as the function of the FSM size. Our results demonstrate that TT can be trained from scratch without supervised data, even with very noisy PLs. We validate the proposed framework on 6 languages from CommonVoice and propose multiple heuristics to filter out hallucinated PLs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes