ASLGSDJul 2, 2021

Multi-user VoiceFilter-Lite via Attentive Speaker Embedding

arXiv:2107.01201v212 citations
AI Analysis

This work addresses the challenge of multi-user speech processing for applications like ASR and verification, though it is incremental as it builds on existing models like VoiceFilter-Lite.

The paper tackled the problem of enabling speaker-conditioned speech models to handle multiple enrolled users simultaneously, achieving significant reductions in speech recognition and speaker verification errors with up to four users in overlapping speech scenarios.

In this paper, we propose a solution to allow speaker conditioned speech models, such as VoiceFilter-Lite, to support an arbitrary number of enrolled users in a single pass. This is achieved by using an attention mechanism on multiple speaker embeddings to compute a single attentive embedding, which is then used as a side input to the model. We implemented multi-user VoiceFilter-Lite and evaluated it for three tasks: (1) a streaming automatic speech recognition (ASR) task; (2) a text-independent speaker verification task; and (3) a personalized keyphrase detection task, where ASR has to detect keyphrases from multiple enrolled users in a noisy environment. Our experiments show that, with up to four enrolled users, multi-user VoiceFilter-Lite is able to significantly reduce speech recognition and speaker verification errors when there is overlapping speech, without affecting performance under other acoustic conditions. This attentive speaker embedding approach can also be easily applied to other speaker-conditioned models such as personal VAD and personalized ASR.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes