Rethinking Music Captioning with Music Metadata LLMs
This addresses the challenge of scarce high-quality music caption data for music understanding and controllable generation, though it is incremental as it builds on existing metadata and LLM techniques.
The paper tackles the problem of generating natural language descriptions of music by proposing a metadata-based captioning method that predicts metadata from audio and converts it to captions using pre-trained LLMs, achieving comparable performance to end-to-end baselines with less training time and offering flexibility in stylization and metadata imputation.
Music captioning, or the task of generating a natural language description of music, is useful for both music understanding and controllable music generation. Training captioning models, however, typically requires high-quality music caption data which is scarce compared to metadata (e.g., genre, mood, etc.). As a result, it is common to use large language models (LLMs) to synthesize captions from metadata to generate training data for captioning models, though this process imposes a fixed stylization and entangles factual information with natural language style. As a more direct approach, we propose metadata-based captioning. We train a metadata prediction model to infer detailed music metadata from audio and then convert it into expressive captions via pre-trained LLMs at inference time. Compared to a strong end-to-end baseline trained on LLM-generated captions derived from metadata, our method: (1) achieves comparable performance in less training time over end-to-end captioners, (2) offers flexibility to easily change stylization post-training, enabling output captions to be tailored to specific stylistic and quality requirements, and (3) can be prompted with audio and partial metadata to enable powerful metadata imputation or in-filling--a common task for organizing music data.