CVMMSDASFeb 14, 2022

Visual Acoustic Matching

arXiv:2202.06875v269 citations
Originality Highly original
AI Analysis

This addresses the problem of realistic audio synthesis for applications like VR/AR or media production, representing a novel task rather than an incremental improvement.

The paper tackles the problem of transforming audio to match the acoustics of a target environment based on an image, introducing a cross-modal transformer model with self-supervised training from web videos. The approach successfully translates human speech to various real-world environments, outperforming traditional and supervised baselines.

We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment. Given an image of the target environment and a waveform for the source audio, the goal is to re-synthesize the audio to match the target room acoustics as suggested by its visible geometry and materials. To address this novel task, we propose a cross-modal transformer model that uses audio-visual attention to inject visual properties into the audio and generate realistic audio output. In addition, we devise a self-supervised training objective that can learn acoustic matching from in-the-wild Web videos, despite their lack of acoustically mismatched audio. We demonstrate that our approach successfully translates human speech to a variety of real-world environments depicted in images, outperforming both traditional acoustic matching and more heavily supervised baselines.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes