MuLan: A Joint Embedding of Music Audio and Natural Language
This work addresses the need for more flexible music tagging and retrieval systems for users and researchers, moving beyond rigid ontologies to handle diverse genres and text styles.
The paper tackles the problem of linking music audio directly to unconstrained natural language descriptions, presenting MuLan, a joint audio-text embedding model trained on 44 million music recordings and free-form text annotations, which achieves zero-shot functionalities and subsumes existing ontologies.
Music tagging and content-based retrieval systems have traditionally been constructed using pre-defined ontologies covering a rigid set of music attributes or text queries. This paper presents MuLan: a first attempt at a new generation of acoustic models that link music audio directly to unconstrained natural language music descriptions. MuLan takes the form of a two-tower, joint audio-text embedding model trained using 44 million music recordings (370K hours) and weakly-associated, free-form text annotations. Through its compatibility with a wide range of music genres and text styles (including conventional music tags), the resulting audio-text representation subsumes existing ontologies while graduating to true zero-shot functionalities. We demonstrate the versatility of the MuLan embeddings with a range of experiments including transfer learning, zero-shot music tagging, language understanding in the music domain, and cross-modal retrieval applications.