ASMMSDIVOct 15, 2020

Muse: Multi-modal target speaker extraction with visual cues

arXiv:2010.07775v372 citations
Originality Highly original
AI Analysis

This addresses the challenge of speaker extraction in noisy environments for applications like hearing aids or speech processing, offering a novel approach that eliminates the need for pre-recorded references.

The paper tackles the problem of extracting a target speaker's voice from a mixture without needing pre-recorded reference speech by using visual lip movement cues, resulting in a method that outperforms baselines in SI-SDR and PESQ metrics and shows consistent cross-dataset improvements.

Speaker extraction algorithm relies on the speech sample from the target speaker as the reference point to focus its attention. Such a reference speech is typically pre-recorded. On the other hand, the temporal synchronization between speech and lip movement also serves as an informative cue. Motivated by this idea, we study a novel technique to use speech-lip visual cues to extract reference target speech directly from mixture speech during inference time, without the need of pre-recorded reference speech. We propose a multi-modal speaker extraction network, named MuSE, that is conditioned only on a lip image sequence. MuSE not only outperforms other competitive baselines in terms of SI-SDR and PESQ, but also shows consistent improvement in cross-dataset evaluations.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes