CVLGMMSDASMay 23, 2024

Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

arXiv:2405.14598v26 citationsh-index: 15
Originality Incremental advance
AI Analysis

This work addresses the slow progress in audio-visual generation for applications needing cross-modal synthesis, though it is incremental as it adapts existing transformer techniques to this domain.

The paper tackles the problem of audio-visual generation, specifically image-to-audio and audio-to-image tasks, by proposing a simple lightweight generative transformer that operates in discrete VQ-GAN spaces and uses mask denoising training. The method surpasses recent image-to-audio generation methods, achieving excellent results without requiring extra training for classifier-free guidance.

In recent years, with the realistic generation results and a wide range of personalized applications, diffusion-based generative models gain huge attention in both visual and audio generation areas. Compared to the considerable advancements of text2image or text2audio generation, research in audio2visual or visual2audio generation has been relatively slow. The recent audio-visual generation methods usually resort to huge large language model or composable diffusion models. Instead of designing another giant model for audio-visual generation, in this paper we take a step back showing a simple and lightweight generative transformer, which is not fully investigated in multi-modal generation, can achieve excellent results on image2audio generation. The transformer operates in the discrete audio and visual Vector-Quantized GAN space, and is trained in the mask denoising manner. After training, the classifier-free guidance could be deployed off-the-shelf achieving better performance, without any extra training or modification. Since the transformer model is modality symmetrical, it could also be directly deployed for audio2image generation and co-generation. In the experiments, we show that our simple method surpasses recent image2audio generation methods. Generated audio samples can be found at https://docs.google.com/presentation/d/1ZtC0SeblKkut4XJcRaDsSTuCRIXB3ypxmSi7HTY3IyQ/

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes