AS CL SDFeb 7, 2021

Time-Domain Speech Extraction with Spatial Information and Multi Speaker Conditioning Mechanism

Jisi Zhang, Catalin Zorila, Rama Doddipatla, Jon Barker

arXiv:2102.03762v14.33 citations

Originality Incremental advance

AI Analysis

This work provides a strong specific gain in speech extraction and recognition accuracy for users in noisy and reverberant environments, which is an incremental improvement over existing methods.

This paper introduces a multi-channel speech extraction system that simultaneously extracts multiple clean speech sources from noisy and reverberant mixtures. The system achieves a 9% relative improvement in source separation performance and over 16% relative increase in speech recognition accuracy on 2-channel WHAMR! data compared to a strong multi-channel baseline.

In this paper, we present a novel multi-channel speech extraction system to simultaneously extract multiple clean individual sources from a mixture in noisy and reverberant environments. The proposed method is built on an improved multi-channel time-domain speech separation network which employs speaker embeddings to identify and extract multiple targets without label permutation ambiguity. To efficiently inform the speaker information to the extraction model, we propose a new speaker conditioning mechanism by designing an additional speaker branch for receiving external speaker embeddings. Experiments on 2-channel WHAMR! data show that the proposed system improves by 9% relative the source separation performance over a strong multi-channel baseline, and it increases the speech recognition accuracy by more than 16% relative over the same baseline.

View on arXiv PDF

Similar