SDGRMMASJan 11, 2022

Music2Video: Automatic Generation of Music Video with fusion of audio and text

arXiv:2201.03809v27 citationsHas Code
AI Analysis

This work addresses the need for interactive music video creation tools for users, though it is incremental as it builds on existing multi-modal generative methods.

The paper tackles the problem of generating music videos by fusing audio and text modalities, resulting in a framework that produces videos consistent with both inputs and includes automatic segmentation for time consistency.

Creation of images using generative adversarial networks has been widely adapted into multi-modal regime with the advent of multi-modal representation models pre-trained on large corpus. Various modalities sharing a common representation space could be utilized to guide the generative models to create images from text or even from audio source. Departing from the previous methods that solely rely on either text or audio, we exploit the expressiveness of both modality. Based on the fusion of text and audio, we create video whose content is consistent with the distinct modalities that are provided. A simple approach to automatically segment the video into variable length intervals and maintain time consistency in generated video is part of our method. Our proposed framework for generating music video shows promising results in application level where users can interactively feed in music source and text source to create artistic music videos. Our code is available at https://github.com/joeljang/music2video.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes