SDMar 6

Which Data Matter? Embedding-Based Data Selection for Speech Recognition

arXiv:2603.05819v1h-index: 39
Predicted impact top 4% in SD · last 90 daysOriginality Highly original
AI Analysis

This work provides a method for improving the performance of specialist ASR models by efficiently selecting relevant training data, which is significant for developers building domain-specific speech recognition systems.

This paper addresses the challenge of training specialist ASR models on large, heterogeneous datasets by proposing a targeted data selection strategy. By selecting a relevant 5% subset of 100k hours of in-the-wild training data using embeddings, the authors achieved up to a 36.8% relative WER reduction compared to models trained on the full dataset.

Modern ASR systems are typically trained on large-scale pseudo-labeled, in-the-wild data spanning multiple domains. While such heterogeneous data benefit generalist models designed for broad deployment, they pose challenges for specialist models targeting specific domains: specialist models lack the capacity to learn from all available data, and one must pay closer attention to addressing the mismatch between training and test conditions. In this work, we study targeted data selection as a strategy to address these challenges, selecting relevant subsets from 100k hours of in-the-wild training data to optimize performance on target domains. We represent speech samples using embeddings that capture complementary characteristic--speaker attributes, phonetic content, and semantic meaning--and analyze how relevance and diversity along these axes when performing data selection affect downstream ASR performance. Our experiments with CTC-based Conformer models show that training on a strategically selected 5% subset can exceed the performance of models trained on the full dataset by up to 36.8% relative WER reduction on target domains.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes