CV MM SD ASFeb 14, 2022

Visual Acoustic Matching

Changan Chen, Ruohan Gao, Paul Calamia, Kristen Grauman

arXiv:2202.06875v220.169 citations

Originality Highly original

AI Analysis

This addresses the problem of realistic audio synthesis for applications like VR/AR or media production, representing a novel task rather than an incremental improvement.

The paper tackles the problem of transforming audio to match the acoustics of a target environment based on an image, introducing a cross-modal transformer model with self-supervised training from web videos. The approach successfully translates human speech to various real-world environments, outperforming traditional and supervised baselines.

We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment. Given an image of the target environment and a waveform for the source audio, the goal is to re-synthesize the audio to match the target room acoustics as suggested by its visible geometry and materials. To address this novel task, we propose a cross-modal transformer model that uses audio-visual attention to inject visual properties into the audio and generate realistic audio output. In addition, we devise a self-supervised training objective that can learn acoustic matching from in-the-wild Web videos, despite their lack of acoustically mismatched audio. We demonstrate that our approach successfully translates human speech to a variety of real-world environments depicted in images, outperforming both traditional acoustic matching and more heavily supervised baselines.

View on arXiv PDF

Similar