ASSDJan 14, 2021

Speaker activity driven neural speech extraction

arXiv:2101.05516v240 citations
AI Analysis

This work addresses speech extraction for applications like meeting processing, but it is incremental as it adapts an existing approach using speaker activity instead of new recordings.

The paper tackled the problem of extracting a target speaker's speech from a mixture by using speaker activity information as an auxiliary clue, achieving competitive performance with enrollment-based methods and improving ASR with up to 25% relative word error rate reduction in high overlapping conditions.

Target speech extraction, which extracts the speech of a target speaker in a mixture given auxiliary speaker clues, has recently received increased interest. Various clues have been investigated such as pre-recorded enrollment utterances, direction information, or video of the target speaker. In this paper, we explore the use of speaker activity information as an auxiliary clue for single-channel neural network-based speech extraction. We propose a speaker activity driven speech extraction neural network (ADEnet) and show that it can achieve performance levels competitive with enrollment-based approaches, without the need for pre-recordings. We further demonstrate the potential of the proposed approach for processing meeting-like recordings, where the speaker activity is obtained from a diarization system. We show that this simple yet practical approach can successfully extract speakers after diarization, which results in improved ASR performance, especially in high overlapping conditions, with a relative word error rate reduction of up to 25%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes