Benchmarking Representations for Speech, Music, and Acoustic Events
This work addresses the need for systematic comparison in audio representation learning for researchers, though it is incremental as it builds on existing benchmarking practices.
The authors tackled the problem of limited diversity in audio representation learning benchmarks by introducing ARCH, a comprehensive benchmark covering speech, music, and acoustic events with 12 datasets, which enabled thorough assessment of pre-trained SSL models and revealed strong performance on non-speech datasets.
Limited diversity in standardized benchmarks for evaluating audio representation learning (ARL) methods may hinder systematic comparison of current methods' capabilities. We present ARCH, a comprehensive benchmark for evaluating ARL methods on diverse audio classification domains, covering acoustic events, music, and speech. ARCH comprises 12 datasets, that allow us to thoroughly assess pre-trained SSL models of different sizes. ARCH streamlines benchmarking of ARL techniques through its unified access to a wide range of domains and its ability to readily incorporate new datasets and models. To address the current lack of open-source, pre-trained models for non-speech audio, we also release new pre-trained models that demonstrate strong performance on non-speech datasets. We argue that the presented wide-ranging evaluation provides valuable insights into state-of-the-art ARL methods, and is useful to pinpoint promising research directions.