AS LG SD SP MLSep 9, 2020

VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition

Quan Wang, Ignacio Lopez Moreno, Mert Saglam, Kevin Wilson, Alan Chiao, Renjie Liu, Yanzhang He, Wei Li, Jason Pelecanos, Marily Nika, Alexander Gruenstein

arXiv:2009.04323v118.699 citationsh-index: 31

Originality Incremental advance

AI Analysis

This work addresses the challenge of improving speech recognition accuracy in noisy, multi-speaker environments for on-device applications, though it is incremental by building on existing source separation methods.

The paper tackles the problem of separating a target speaker's voice from overlapped speech in on-device streaming speech recognition, achieving a 15.6% relative reduction in word error rate on overlapped speech without degrading performance in other conditions.

We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user, as part of a streaming speech recognition system. Delivering such a model presents numerous challenges: It should improve the performance when the input signal consists of overlapped speech, and must not hurt the speech recognition performance under all other acoustic conditions. Besides, this model must be tiny, fast, and perform inference in a streaming fashion, in order to have minimal impact on CPU, memory, battery and latency. We propose novel techniques to meet these multi-faceted requirements, including using a new asymmetric loss, and adopting adaptive runtime suppression strength. We also show that such a model can be quantized as a 8-bit integer model and run in realtime.

View on arXiv PDF

Similar