SDAICVMMMar 17

Diffusion Models for Joint Audio-Video Generation

arXiv:2603.1609344.5
AI Analysis

This addresses the open challenge of truly joint audio-video generation for multimodal AI applications, though it is incremental as it builds on existing diffusion models.

The paper tackles joint audio-video generation by releasing two paired datasets (13 hours of video-game clips and 64 hours of concert performances) and proposing a sequential two-step text-to-audio-video pipeline that generates video first, then conditions on it to synthesize synchronized audio, yielding high-fidelity results.

Multimodal generative models have shown remarkable progress in single-modality video and audio synthesis, yet truly joint audio-video generation remains an open challenge. In this paper, I explore four key contributions to advance this field. First, I release two high-quality, paired audio-video datasets. The datasets consisting on 13 hours of video-game clips and 64 hours of concert performances, each segmented into consistent 34-second samples to facilitate reproducible research. Second, I train the MM-Diffusion architecture from scratch on our datasets, demonstrating its ability to produce semantically coherent audio-video pairs and quantitatively evaluating alignment on rapid actions and musical cues. Third, I investigate joint latent diffusion by leveraging pretrained video and audio encoder-decoders, uncovering challenges and inconsistencies in the multimodal decoding stage. Finally, I propose a sequential two-step text-to-audio-video generation pipeline: first generating video, then conditioning on both the video output and the original prompt to synthesize temporally synchronized audio. My experiments show that this modular approach yields high-fidelity generations of audio video generation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes