LyCon: Lyrics Reconstruction from the Bag-of-Words Using Large Language Models
This enables academic research in lyric studies by providing a copyright-free dataset for experiments like conditional lyric generation, though it is incremental as it adapts existing methods to a specific domain.
The paper tackles the challenge of copyright restrictions on lyrics by reconstructing copyright-free lyrics from Bag-of-Words datasets using metadata and large language models, resulting in the LyCon dataset aligned with sources like the Million Song Dataset.
This paper addresses the unique challenge of conducting research in lyric studies, where direct use of lyrics is often restricted due to copyright concerns. Unlike typical data, internet-sourced lyrics are frequently protected under copyright law, necessitating alternative approaches. Our study introduces a novel method for generating copyright-free lyrics from publicly available Bag-of-Words (BoW) datasets, which contain the vocabulary of lyrics but not the lyrics themselves. Utilizing metadata associated with BoW datasets and large language models, we successfully reconstructed lyrics. We have compiled and made available a dataset of reconstructed lyrics, LyCon, aligned with metadata from renowned sources including the Million Song Dataset, Deezer Mood Detection Dataset, and AllMusic Genre Dataset, available for public access. We believe that the integration of metadata such as mood annotations or genres enables a variety of academic experiments on lyrics, such as conditional lyric generation.