SDAIMMASJan 17, 2025

GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions

arXiv:2501.09972v116 citationsh-index: 8AAAI
Originality Incremental advance
AI Analysis

This addresses the challenge of composing music for video applications, offering a versatile solution for generating multi-style music, though it appears incremental with novel metrics and dataset.

The paper tackles the problem of automating music generation for video by proposing GVMGen, a model that uses hierarchical attentions to align video and music features, resulting in improved music-video correspondence, generative diversity, and application universality compared to previous models.

Composing music for video is essential yet challenging, leading to a growing interest in automating music generation for video applications. Existing approaches often struggle to achieve robust music-video correspondence and generative diversity, primarily due to inadequate feature alignment methods and insufficient datasets. In this study, we present General Video-to-Music Generation model (GVMGen), designed for generating high-related music to the video input. Our model employs hierarchical attentions to extract and align video features with music in both spatial and temporal dimensions, ensuring the preservation of pertinent features while minimizing redundancy. Remarkably, our method is versatile, capable of generating multi-style music from different video inputs, even in zero-shot scenarios. We also propose an evaluation model along with two novel objective metrics for assessing video-music alignment. Additionally, we have compiled a large-scale dataset comprising diverse types of video-music pairs. Experimental results demonstrate that GVMGen surpasses previous models in terms of music-video correspondence, generative diversity, and application universality.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes