CLSDASOct 22, 2019

Improving Transformer-based Speech Recognition Using Unsupervised Pre-training

arXiv:1910.09932v3105 citations
Originality Incremental advance
AI Analysis

This work addresses the high cost of data collection for speech recognition systems, offering an incremental improvement in performance for industrial applications.

The paper tackles the problem of expensive transcribed data for speech recognition by proposing Masked Predictive Coding for unsupervised pre-training with Transformer models, achieving a CER of 23.3% on HKUST, exceeding the best end-to-end model by 0.2% absolute CER, and reducing CER to 21.0% with more data for an 11.8% relative reduction over baseline.

Speech recognition technologies are gaining enormous popularity in various industrial applications. However, building a good speech recognition system usually requires large amounts of transcribed data, which is expensive to collect. To tackle this problem, an unsupervised pre-training method called Masked Predictive Coding is proposed, which can be applied for unsupervised pre-training with Transformer based model. Experiments on HKUST show that using the same training data, we can achieve CER 23.3%, exceeding the best end-to-end model by over 0.2% absolute CER. With more pre-training data, we can further reduce the CER to 21.0%, or a 11.8% relative CER reduction over baseline.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes