AS SDFeb 21, 2022

L-SpEx: Localized Target Speaker Extraction

Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li

arXiv:2202.09995v18.631 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the challenge of speaker extraction in noisy environments for applications like hearing aids or speech recognition, though it is incremental by building on prior work that used location or visual information.

The paper tackles the problem of extracting a target speaker's voice from a multi-talker mixture without relying on visual cues, by proposing L-SpEx, an end-to-end method that uses speech cues to localize and extract the speaker, achieving significant performance improvements over baseline systems on the MC-Libri2Mix dataset.

Speaker extraction aims to extract the target speaker's voice from a multi-talker speech mixture given an auxiliary reference utterance. Recent studies show that speaker extraction benefits from the location or direction of the target speaker. However, these studies assume that the target speaker's location is known in advance or detected by an extra visual cue, e.g., face image or video. In this paper, we propose an end-to-end localized target speaker extraction on pure speech cues, that is called L-SpEx. Specifically, we design a speaker localizer driven by the target speaker's embedding to extract the spatial features, including direction-of-arrival (DOA) of the target speaker and beamforming output. Then, the spatial cues and target speaker's embedding are both used to form a top-down auditory attention to the target speaker. Experiments on the multi-channel reverberant dataset called MC-Libri2Mix show that our L-SpEx approach significantly outperforms the baseline system.

View on arXiv PDF Code

Similar