CVDec 2, 2025

Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation

arXiv:2512.02457v21 citationsh-index: 10
Originality Incremental advance
AI Analysis

This addresses the problem of enhancing video generation quality for AI systems by leveraging cross-modal training, though it is incremental as it builds on existing audio-video generative methods.

The paper investigates whether audio-video joint denoising training improves video generation quality, even when focusing only on video, and finds consistent improvements on challenging motion subsets, suggesting audio acts as a privileged signal for better video dynamics.

Recent audio-video generative systems suggest that coupling modalities benefits not only audio-video synchrony but also the video modality itself. We pose a fundamental question: Does audio-video joint denoising training improve video generation, even when we only care about video quality? To study this, we introduce a parameter-efficient Audio-Video Full DiT (AVFullDiT) architecture that leverages pre-trained text-to-video (T2V) and text-to-audio (T2A) modules for joint denoising. We train (i) a T2AV model with AVFullDiT and (ii) a T2V-only counterpart under identical settings. Our results provide the first systematic evidence that audio-video joint denoising can deliver more than synchrony. We observe consistent improvements on challenging subsets featuring large and object contact motions. We hypothesize that predicting audio acts as a privileged signal, encouraging the model to internalize causal relationships between visual events and their acoustic consequences (e.g., collision $\times$ impact sound), which in turn regularizes video dynamics. Our findings suggest that cross-modal co-training is a promising approach to developing stronger, more physically grounded world models. Code and dataset will be made publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes