CL SD ASOct 22, 2019

Improving Transformer-based Speech Recognition Using Unsupervised Pre-training

Dongwei Jiang, Xiaoning Lei, Wubo Li, Ne Luo, Yuxuan Hu, Wei Zou, Xiangang Li

arXiv:1910.09932v37.3105 citations

Originality Incremental advance

AI Analysis

This work addresses the high cost of data collection for speech recognition systems, offering an incremental improvement in performance for industrial applications.

The paper tackles the problem of expensive transcribed data for speech recognition by proposing Masked Predictive Coding for unsupervised pre-training with Transformer models, achieving a CER of 23.3% on HKUST, exceeding the best end-to-end model by 0.2% absolute CER, and reducing CER to 21.0% with more data for an 11.8% relative reduction over baseline.

Speech recognition technologies are gaining enormous popularity in various industrial applications. However, building a good speech recognition system usually requires large amounts of transcribed data, which is expensive to collect. To tackle this problem, an unsupervised pre-training method called Masked Predictive Coding is proposed, which can be applied for unsupervised pre-training with Transformer based model. Experiments on HKUST show that using the same training data, we can achieve CER 23.3%, exceeding the best end-to-end model by over 0.2% absolute CER. With more pre-training data, we can further reduce the CER to 21.0%, or a 11.8% relative CER reduction over baseline.

View on arXiv PDF

Similar