LGSDASMLApr 2, 2019

Lessons from Building Acoustic Models with a Million Hours of Speech

arXiv:1904.01624v190 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of data scarcity in speech recognition for researchers and practitioners, though it is incremental in scaling existing methods.

The authors tackled the problem of building acoustic models with limited labeled data (7,000 hours) by leveraging a massive amount of unlabeled speech (1 million hours), achieving relative word error rate improvements of 10-20% with minimal hyper-parameter tuning.

This is a report of our lessons learned building acoustic models from 1 Million hours of unlabeled speech, while labeled speech is restricted to 7,000 hours. We employ student/teacher training on unlabeled data, helping scale out target generation in comparison to confidence model based methods, which require a decoder and a confidence model. To optimize storage and to parallelize target generation, we store high valued logits from the teacher model. Introducing the notion of scheduled learning, we interleave learning on unlabeled and labeled data. To scale distributed training across a large number of GPUs, we use BMUF with 64 GPUs, while performing sequence training only on labeled data with gradient threshold compression SGD using 16 GPUs. Our experiments show that extremely large amounts of data are indeed useful; with little hyper-parameter tuning, we obtain relative WER improvements in the 10 to 20% range, with higher gains in noisier conditions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes