Sound Effects Dataset Unification With the Universal Category System
This work addresses the problem of incompatible taxonomies in sound effects datasets, enabling data merging and comparable research outcomes for the audio machine learning community.
The authors propose a modular framework to unify sound effects datasets under the Universal Category System (UCS), creating the EnvSound-UCS dataset with 58,057 clips from three sources. The framework achieves high automatic conversion rates via rule-based pipelines and conflict resolution.
Sound effects (SFX) datasets and libraries often employ distinct tagging schemes, taxonomies, and metadata structures. This creates challenges for research on SFX classification and generation because incompatible taxonomies lead to siloed datasets that might require individualized approaches, result in non-comparable outcomes, and prevent data merging strategies. We propose a modular dataset relabeling framework that adopts the Universal Category System (UCS), an industry-standard hierarchical taxonomy for sound effects, as a shared structural foundation. This open-source framework enables us (i) to convert tags of existing datasets to UCS with a rule-based multi-stage pipeline and conflict resolution to achieve high automatic conversion rates, (ii) to suggest a stratified dataset split for the new labels, and (iii) to combine multiple datasets. To showcase the practical utility, we introduce the EnvSound-UCS dataset, a publicly available unified UCS-compliant dataset of environmental sounds with 58,057 sound clips from three sources: AudioSet, FSD50K, and ESC-50.