LG CV MM SD ASDec 8, 2023

CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling

Ruihan Yang, Hannes Gamper, Sebastian Braun

arXiv:2312.05412v29.810 citationsh-index: 24ECCV Workshops

Originality Incremental advance

AI Analysis

This work addresses the challenge of generating synchronized video and audio content, which is incremental as it builds on existing diffusion models with novel architectural and training components.

The paper tackles the problem of bi-directional conditional generation of video and audio by introducing a multi-modal diffusion model with a joint contrastive training loss to improve synchronization. The results show that the model outperforms baselines in quality and generation speed, with improvements in audio-visual alignment, especially for video-to-audio generation.

We introduce a multi-modal diffusion model tailored for the bi-directional conditional generation of video and audio. We propose a joint contrastive training loss to improve the synchronization between visual and auditory occurrences. We present experiments on two datasets to evaluate the efficacy of our proposed model. The assessment of generation quality and alignment performance is carried out from various angles, encompassing both objective and subjective metrics. Our findings demonstrate that the proposed model outperforms the baseline in terms of quality and generation speed through introduction of our novel cross-modal easy fusion architectural block. Furthermore, the incorporation of the contrastive loss results in improvements in audio-visual alignment, particularly in the high-correlation video-to-audio generation task.

View on arXiv PDF

Similar