ASLGSPMLOct 11, 2018

VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

arXiv:1810.04826v6420 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of improving speech recognition accuracy in noisy, multi-speaker environments, which is incremental as it builds on existing speaker separation techniques.

The paper tackles the problem of separating a target speaker's voice from multi-speaker signals using a reference signal, achieving significant reduction in speech recognition word error rate (WER) for multi-speaker scenarios with minimal degradation in single-speaker cases.

In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals.

Code Implementations5 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes