LGSDASNov 3, 2020

Training Wake Word Detection with Synthesized Speech Data on Confusion Words

arXiv:2011.01460v13 citations
AI Analysis

This work addresses robustness issues in keyword spotting systems for real-life applications, but it is incremental as it builds on existing data augmentation techniques.

The paper tackled the problem of confusing words degrading wake word detection performance by investigating two data augmentation methods: synthesized speech from a multi-speaker TTS system and random noise addition to acoustic features. The results showed that augmentations improved robustness, with synthetic data leading to significant gains in confusing word scenarios.

Confusing-words are commonly encountered in real-life keyword spotting applications, which causes severe degradation of performance due to complex spoken terms and various kinds of words that sound similar to the predefined keywords. To enhance the wake word detection system's robustness on such scenarios, we investigate two data augmentation setups for training end-to-end KWS systems. One is involving the synthesized data from a multi-speaker speech synthesis system, and the other augmentation is performed by adding random noise to the acoustic feature. Experimental results show that augmentations help improve the system's robustness. Moreover, by augmenting the training set with the synthetic data generated by the multi-speaker text-to-speech system, we achieve a significant improvement regarding confusing words scenario.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes