Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering
This addresses the problem of limited labeled data and computational resources for small organizations adapting ASR models to specific domains, though it is incremental as it builds on existing pseudo-labeling and filtering techniques.
The paper tackles efficient domain adaptation for ASR by proposing a multi-stage filtering method using pseudo-labels from Whisper and Zipformer models, reducing the required training data from 7500 hours to 100 hours while maintaining a 12.3% WER on call center data.
Fine-tuning pretrained ASR models for specific domains is challenging for small organizations with limited labeled data and computational resources. Here, we explore different data selection pipelines and propose a robust approach that improves ASR adaptation by filtering pseudo-labels generated using Whisper (encoder-decoder) and Zipformer (transducer) models. Our approach integrates multiple selection strategies -- including word error rate (WER) prediction, named entity recognition (NER), and character error rate (CER) analysis -- to extract high-quality training segments. We evaluate our method on Whisper and Zipformer using a 7500-hour baseline, comparing it to a CER-based approach relying on hypotheses from three ASR systems. Fine-tuning on 7500 hours of pseudo-labeled call center data achieves 12.3% WER, while our filtering reduces the dataset to 100 hours (1.4%) with similar performance; a similar trend is observed on Fisher English.