LGCVMMSDASDec 8, 2023

CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling

arXiv:2312.05412v210 citationsh-index: 24ECCV Workshops
Originality Incremental advance
AI Analysis

This work addresses the challenge of generating synchronized video and audio content, which is incremental as it builds on existing diffusion models with novel architectural and training components.

The paper tackles the problem of bi-directional conditional generation of video and audio by introducing a multi-modal diffusion model with a joint contrastive training loss to improve synchronization. The results show that the model outperforms baselines in quality and generation speed, with improvements in audio-visual alignment, especially for video-to-audio generation.

We introduce a multi-modal diffusion model tailored for the bi-directional conditional generation of video and audio. We propose a joint contrastive training loss to improve the synchronization between visual and auditory occurrences. We present experiments on two datasets to evaluate the efficacy of our proposed model. The assessment of generation quality and alignment performance is carried out from various angles, encompassing both objective and subjective metrics. Our findings demonstrate that the proposed model outperforms the baseline in terms of quality and generation speed through introduction of our novel cross-modal easy fusion architectural block. Furthermore, the incorporation of the contrastive loss results in improvements in audio-visual alignment, particularly in the high-correlation video-to-audio generation task.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes