SDLGASNov 10, 2022

Vis2Mus: Exploring Multimodal Representation Mapping for Controllable Music Generation

ByteDance
arXiv:2211.05543v16 citationsh-index: 24
Originality Incremental advance
AI Analysis

This work addresses the challenge of controllable music generation for artists and musicians by providing an interpretable mapping from visual to music domains, though it is incremental as it builds on existing multimodal representation learning with a novel analysis-by-synthesis approach.

The study tackled the problem of using visual arts to control music generation by exploring multimodal representation mapping, resulting in the discovery of an interpretable mapping with equivariant-like properties that allows image transformations to control corresponding music transformations, and the release of the Vis2Mus system as a controllable interface.

In this study, we explore the representation mapping from the domain of visual arts to the domain of music, with which we can use visual arts as an effective handle to control music generation. Unlike most studies in multimodal representation learning that are purely data-driven, we adopt an analysis-by-synthesis approach that combines deep music representation learning with user studies. Such an approach enables us to discover \textit{interpretable} representation mapping without a huge amount of paired data. In particular, we discover that visual-to-music mapping has a nice property similar to equivariant. In other words, we can use various image transformations, say, changing brightness, changing contrast, style transfer, to control the corresponding transformations in the music domain. In addition, we released the Vis2Mus system as a controllable interface for symbolic music generation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes