CVLGMMSDASMay 20, 2024

Images that Sound: Composing Images and Sounds on a Single Canvas

arXiv:2405.12221v317 citationsh-index: 6NIPS
Originality Incremental advance
AI Analysis

This work addresses the challenge of multimodal content creation for applications in art and media, though it is incremental as it builds on existing diffusion models.

The paper tackles the problem of synthesizing spectrograms that simultaneously resemble natural images and produce natural audio, achieving this through a zero-shot approach that aligns audio and image prompts in a shared latent space.

Spectrograms are 2D representations of sound that look very different from the images found in our visual world. And natural images, when played as spectrograms, make unnatural sounds. In this paper, we show that it is possible to synthesize spectrograms that simultaneously look like natural images and sound like natural audio. We call these visual spectrograms images that sound. Our approach is simple and zero-shot, and it leverages pre-trained text-to-image and text-to-spectrogram diffusion models that operate in a shared latent space. During the reverse process, we denoise noisy latents with both the audio and image diffusion models in parallel, resulting in a sample that is likely under both models. Through quantitative evaluations and perceptual studies, we find that our method successfully generates spectrograms that align with a desired audio prompt while also taking the visual appearance of a desired image prompt. Please see our project page for video results: https://ificl.github.io/images-that-sound/

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes