S2Cap: A Benchmark and a Baseline for Singing Style Captioning
This work addresses a gap in resources for singing voice analysis, enabling downstream tasks like style captioning, but it is incremental as it builds on existing audio-text dataset concepts.
The authors tackled the lack of detailed audio-text datasets for singing voices by introducing S2Cap, a benchmark dataset with comprehensive descriptions of vocal, acoustic, and demographic characteristics, and developed a baseline algorithm for singing style captioning.
Singing voices contain much richer information than common voices, including varied vocal and acoustic properties. However, current open-source audio-text datasets for singing voices capture only a narrow range of attributes and lack acoustic features, leading to limited utility towards downstream tasks, such as style captioning. To fill this gap, we formally define the singing style captioning task and present S2Cap, a dataset of singing voices with detailed descriptions covering diverse vocal, acoustic, and demographic characteristics. Using this dataset, we develop an efficient and straightforward baseline algorithm for singing style captioning. The dataset is available at https://zenodo.org/records/15673764.