ASLGSDSPMLSep 9, 2020

VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition

arXiv:2009.04323v199 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of improving speech recognition accuracy in noisy, multi-speaker environments for on-device applications, though it is incremental by building on existing source separation methods.

The paper tackles the problem of separating a target speaker's voice from overlapped speech in on-device streaming speech recognition, achieving a 15.6% relative reduction in word error rate on overlapped speech without degrading performance in other conditions.

We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user, as part of a streaming speech recognition system. Delivering such a model presents numerous challenges: It should improve the performance when the input signal consists of overlapped speech, and must not hurt the speech recognition performance under all other acoustic conditions. Besides, this model must be tiny, fast, and perform inference in a streaming fashion, in order to have minimal impact on CPU, memory, battery and latency. We propose novel techniques to meet these multi-faceted requirements, including using a new asymmetric loss, and adopting adaptive runtime suppression strength. We also show that such a model can be quantized as a 8-bit integer model and run in realtime.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes