AS CL LG SDDec 3, 2019

Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition

Shaoshi Ling, Yuzong Liu, Julian Salazar, Katrin Kirchhoff

arXiv:1912.01679v226.7147 citations

Originality Highly original

AI Analysis

This work addresses the high cost of labeled data for speech recognition systems, offering a practical solution for resource-constrained applications, though it is incremental as it builds on existing semi-supervised and representation learning methods.

The paper tackles the problem of reducing labeled data requirements for automatic speech recognition by proposing a novel semi-supervised approach using deep contextualized acoustic representations (DeCoAR), achieving 42% and 19% relative improvements over baselines on WSJ eval92 and LibriSpeech test-clean, respectively, and matching performance with only 100 hours of labeled data compared to 960 hours.

We propose a novel approach to semi-supervised automatic speech recognition (ASR). We first exploit a large amount of unlabeled audio data via representation learning, where we reconstruct a temporal slice of filterbank features from past and future context frames. The resulting deep contextualized acoustic representations (DeCoAR) are then used to train a CTC-based end-to-end ASR system using a smaller amount of labeled audio data. In our experiments, we show that systems trained on DeCoAR consistently outperform ones trained on conventional filterbank features, giving 42% and 19% relative improvement over the baseline on WSJ eval92 and LibriSpeech test-clean, respectively. Our approach can drastically reduce the amount of labeled data required; unsupervised training on LibriSpeech then supervision with 100 hours of labeled data achieves performance on par with training on all 960 hours directly. Pre-trained models and code will be released online.

View on arXiv PDF

Similar