AS LG SP MLOct 11, 2018

VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

Quan Wang, Hannah Muckenhirn, Kevin Wilson, Prashant Sridhar, Zelin Wu, John Hershey, Rif A. Saurous, Ron J. Weiss, Ye Jia, Ignacio Lopez Moreno

arXiv:1810.04826v633.5420 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the challenge of improving speech recognition accuracy in noisy, multi-speaker environments, which is incremental as it builds on existing speaker separation techniques.

The paper tackles the problem of separating a target speaker's voice from multi-speaker signals using a reference signal, achieving significant reduction in speech recognition word error rate (WER) for multi-speaker scenarios with minimal degradation in single-speaker cases.

In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals.

View on arXiv PDF Code

Similar