SDAIMMASAug 20, 2025

From Sound to Sight: Towards AI-authored Music Videos

arXiv:2509.00029v11 citationsh-index: 192025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)
Originality Incremental advance
AI Analysis

This work addresses the need for more expressive and automated music video generation for users, though it appears incremental as it builds on existing deep learning models.

The authors tackled the problem of limited expressiveness in conventional music visualization by proposing two novel pipelines for automatically generating music videos from any song using off-the-shelf deep learning models, with a preliminary user evaluation demonstrating storytelling potential, visual coherency, and emotional alignment.

Conventional music visualisation systems rely on handcrafted ad hoc transformations of shapes and colours that offer only limited expressiveness. We propose two novel pipelines for automatically generating music videos from any user-specified, vocal or instrumental song using off-the-shelf deep learning models. Inspired by the manual workflows of music video producers, we experiment on how well latent feature-based techniques can analyse audio to detect musical qualities, such as emotional cues and instrumental patterns, and distil them into textual scene descriptions using a language model. Next, we employ a generative model to produce the corresponding video clips. To assess the generated videos, we identify several critical aspects and design and conduct a preliminary user evaluation that demonstrates storytelling potential, visual coherency and emotional alignment with the music. Our findings underscore the potential of latent feature techniques and deep generative models to expand music visualisation beyond traditional approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes