ASLGSDMLJan 31, 2020

Training Keyword Spotters with Limited and Synthesized Speech Data

arXiv:2002.01322v163 citations
AI Analysis

This addresses the data scarcity issue for developers of low-power speech-enabled devices, offering a practical solution for quickly producing keyword spotters, though it is incremental as it builds on existing embedding methods.

The paper tackles the problem of training keyword spotting models with limited real speech data by using synthesized speech and pre-trained embeddings, showing that a model trained on synthetic data with embeddings achieves accuracy equivalent to over 500 real examples, while without embeddings it would require over 4000 real examples.

With the rise of low power speech-enabled devices, there is a growing demand to quickly produce models for recognizing arbitrary sets of keywords. As with many machine learning tasks, one of the most challenging parts in the model creation process is obtaining a sufficient amount of training data. In this paper, we explore the effectiveness of synthesized speech data in training small, spoken term detection models of around 400k parameters. Instead of training such models directly on the audio or low level features such as MFCCs, we use a pre-trained speech embedding model trained to extract useful features for keyword spotting models. Using this speech embedding, we show that a model which detects 10 keywords when trained on only synthetic speech is equivalent to a model trained on over 500 real examples. We also show that a model without our speech embeddings would need to be trained on over 4000 real examples to reach the same accuracy.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes