CVSDApr 11, 2018

The Conversation: Deep Audio-Visual Speech Enhancement

arXiv:1804.04121v2395 citations
AI Analysis

This addresses the problem of speaker isolation in noisy, real-world video settings for applications like hearing aids or video conferencing, representing a novel method rather than an incremental improvement.

The paper tackles the problem of isolating individual speakers from multi-talker simultaneous speech in videos by proposing a deep audio-visual speech enhancement network that predicts both magnitude and phase of the target signal using lip regions. It demonstrates strong results on challenging real-world examples, applicable to unseen speakers and unconstrained environments.

Our goal is to isolate individual speakers from multi-talker simultaneous speech in videos. Existing works in this area have focussed on trying to separate utterances from known speakers in controlled environments. In this paper, we propose a deep audio-visual speech enhancement network that is able to separate a speaker's voice given lip regions in the corresponding video, by predicting both the magnitude and the phase of the target signal. The method is applicable to speakers unheard and unseen during training, and for unconstrained environments. We demonstrate strong quantitative and qualitative results, isolating extremely challenging real-world examples.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes