CVSDAug 22, 2017

Seeing Through Noise: Visually Driven Speaker Separation and Enhancement

arXiv:1708.06767v335 citations
Originality Incremental advance
AI Analysis

This work addresses speaker separation and enhancement in noisy environments, which is useful for applications like video conferencing or surveillance, but it is incremental as it builds on existing audio-visual techniques.

The paper tackles the problem of isolating a specific speaker's voice from noisy audio-visual recordings by using face motion from video to estimate speech, then filtering the audio. The method achieves significant improvements in SDR and PESQ scores on GRID and TCD-TIMIT datasets compared to baseline video-to-speech and audio-only approaches.

Isolating the voice of a specific person while filtering out other voices or background noises is challenging when video is shot in noisy environments. We propose audio-visual methods to isolate the voice of a single speaker and eliminate unrelated sounds. First, face motions captured in the video are used to estimate the speaker's voice, by passing the silent video frames through a video-to-speech neural network-based model. Then the speech predictions are applied as a filter on the noisy input audio. This approach avoids using mixtures of sounds in the learning process, as the number of such possible mixtures is huge, and would inevitably bias the trained model. We evaluate our method on two audio-visual datasets, GRID and TCD-TIMIT, and show that our method attains significant SDR and PESQ improvements over the raw video-to-speech predictions, and a well-known audio-only method.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes