CLMay 26, 2017

Semi-Supervised Model Training for Unbounded Conversational Speech Recognition

Shane Walker, Morten Pedersen, Iroro Orife, Jason Flaks

arXiv:1705.09724v12.712 citations

Originality Incremental advance

AI Analysis

This addresses the high cost and outdated nature of labeled conversational speech data for improving speech recognition accuracy in real-world applications, though it is incremental as it builds on existing semi-supervised methods.

The authors tackled the problem of training conversational speech recognition models with limited labeled data by constructing a large-scale, modern training corpus from unlabeled telephony data, achieving relative WER reductions of 35% and 19% on agent and caller utterances, respectively, and a 5% absolute WER improvement over IBM Watson STT.

For conversational large-vocabulary continuous speech recognition (LVCSR) tasks, up to about two thousand hours of audio is commonly used to train state of the art models. Collection of labeled conversational audio however, is prohibitively expensive, laborious and error-prone. Furthermore, academic corpora like Fisher English (2004) or Switchboard (1992) are inadequate to train models with sufficient accuracy in the unbounded space of conversational speech. These corpora are also timeworn due to dated acoustic telephony features and the rapid advancement of colloquial vocabulary and idiomatic speech over the last decades. Utilizing the colossal scale of our unlabeled telephony dataset, we propose a technique to construct a modern, high quality conversational speech training corpus on the order of hundreds of millions of utterances (or tens of thousands of hours) for both acoustic and language model training. We describe the data collection, selection and training, evaluating the results of our updated speech recognition system on a test corpus of 7K manually transcribed utterances. We show relative word error rate (WER) reductions of {35%, 19%} on {agent, caller} utterances over our seed model and 5% absolute WER improvements over IBM Watson STT on this conversational speech task.

View on arXiv PDF

Similar