SDAICLMMASJun 18, 2025

SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning

arXiv:2506.15154v12 citationsh-index: 17AIMC
Originality Incremental advance
AI Analysis

This work addresses the need for enriched music databases and advances in music AI research, but it is incremental as it builds on existing multi-task learning and dataset extension methods.

The paper tackles the problem of generating detailed captions for music by introducing SonicVerse, a multi-task model that integrates caption generation with auxiliary music feature detection, and reports that this approach improves caption quality and detail.

Detailed captions that accurately reflect the characteristics of a music piece can enrich music databases and drive forward research in music AI. This paper introduces a multi-task music captioning model, SonicVerse, that integrates caption generation with auxiliary music feature detection tasks such as key detection, vocals detection, and more, so as to directly capture both low-level acoustic details as well as high-level musical attributes. The key contribution is a projection-based architecture that transforms audio input into language tokens, while simultaneously detecting music features through dedicated auxiliary heads. The outputs of these heads are also projected into language tokens, to enhance the captioning input. This framework not only produces rich, descriptive captions for short music fragments but also directly enables the generation of detailed time-informed descriptions for longer music pieces, by chaining the outputs using a large-language model. To train the model, we extended the MusicBench dataset by annotating it with music features using MIRFLEX, a modular music feature extractor, resulting in paired audio, captions and music feature data. Experimental results show that incorporating features in this way improves the quality and detail of the generated captions.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes