SD CV LG ASJun 14, 2021

Learning Audio-Visual Dereverberation

Changan Chen, Wei Sun, David Harwath, Kristen Grauman

arXiv:2106.07732v221.039 citationsHas Code

Originality Highly original

AI Analysis

This addresses the problem of reverberation in speech processing for applications like automatic speech recognition and speaker identification, offering a novel multimodal approach that is not incremental.

The paper tackles speech degradation and recognition errors caused by reverberation by introducing an audio-visual dereverberation method that uses visual cues about the room environment to enhance speech. It achieves state-of-the-art performance, substantially improving over audio-only methods in tasks like speech enhancement and recognition.

Reverberation not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition. Prior work attempts to remove reverberation based on the audio modality only. Our idea is to learn to dereverberate speech from audio-visual observations. The visual environment surrounding a human speaker reveals important cues about the room geometry, materials, and speaker location, all of which influence the precise reverberation effects. We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed monaural sound and visual scene. In support of this new task, we develop a large-scale dataset SoundSpaces-Speech that uses realistic acoustic renderings of speech in real-world 3D scans of homes offering a variety of room acoustics. Demonstrating our approach on both simulated and real imagery for speech enhancement, speech recognition, and speaker identification, we show it achieves state-of-the-art performance and substantially improves over audio-only methods.

View on arXiv PDF Code

Similar