SDLGASJun 23, 2023

DISCO-10M: A Large-Scale Music Dataset

arXiv:2306.13512v224 citationsh-index: 25
Originality Synthesis-oriented
AI Analysis

This dataset addresses accessibility and resource gaps for researchers in machine learning for music, though it is incremental as it builds on existing data collection methods.

The authors tackled the problem of limited music datasets by introducing DISCO-10M, a large-scale dataset that is ten times larger than previous ones, with precomputed CLAP embeddings to support downstream tasks.

Music datasets play a crucial role in advancing research in machine learning for music. However, existing music datasets suffer from limited size, accessibility, and lack of audio resources. To address these shortcomings, we present DISCO-10M, a novel and extensive music dataset that surpasses the largest previously available music dataset by an order of magnitude. To ensure high-quality data, we implement a multi-stage filtering process. This process incorporates similarities based on textual descriptions and audio embeddings. Moreover, we provide precomputed CLAP embeddings alongside DISCO-10M, facilitating direct application on various downstream tasks. These embeddings enable efficient exploration of machine learning applications on the provided data. With DISCO-10M, we aim to democratize and facilitate new research to help advance the development of novel machine learning models for music.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes