MMCVSDASSep 11, 2024

VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos

arXiv:2409.07450v120 citationsh-index: 28
Originality Highly original
AI Analysis

This work addresses the challenge of creating realistic and diverse music for videos without relying on limited symbolic annotations, which is important for applications in video editing and multimedia content creation.

The paper tackles the problem of generating background music from video inputs by leveraging large-scale web videos with background music, resulting in a model that outperforms existing approaches on multiple datasets according to various evaluation metrics, including human evaluation.

We present a framework for learning to generate background music from video inputs. Unlike existing works that rely on symbolic musical annotations, which are limited in quantity and diversity, our method leverages large-scale web videos accompanied by background music. This enables our model to learn to generate realistic and diverse music. To accomplish this goal, we develop a generative video-music Transformer with a novel semantic video-music alignment scheme. Our model uses a joint autoregressive and contrastive learning objective, which encourages the generation of music aligned with high-level video content. We also introduce a novel video-beat alignment scheme to match the generated music beats with the low-level motions in the video. Lastly, to capture fine-grained visual cues in a video needed for realistic background music generation, we introduce a new temporal video encoder architecture, allowing us to efficiently process videos consisting of many densely sampled frames. We train our framework on our newly curated DISCO-MV dataset, consisting of 2.2M video-music samples, which is orders of magnitude larger than any prior datasets used for video music generation. Our method outperforms existing approaches on the DISCO-MV and MusicCaps datasets according to various music generation evaluation metrics, including human evaluation. Results are available at https://genjib.github.io/project_page/VMAs/index.html

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes