Unlocking Strong Supervision: A Data-Centric Study of General-Purpose Audio Pre-Training Methods
This work addresses the bottleneck of weak and noisy labels in audio pre-training for researchers and practitioners in audio understanding, though it is incremental by building on vision's blueprint.
The paper tackled the problem of fragmented audio pre-training by establishing a large-scale, strong supervision framework, resulting in a new pipeline that uses a high-fidelity captioner and a Unified Tag System to create SOTA-quality captions, with experiments showing data quality and coverage as key performance drivers.
Current audio pre-training seeks to learn unified representations for broad audio understanding tasks, but it remains fragmented and is fundamentally bottlenecked by its reliance on weak, noisy, and scale-limited labels. Drawing lessons from vision's foundational pre-training blueprint, we argue that the audio field must first establish its own large-scale, strong supervision framework. We introduce a new data-centric pipeline that leverages a high-fidelity captioner to create SOTA-quality captions and the first Unified Tag System (UTS) that bridges speech, music, and environmental sounds. We then conduct a systematic comparative study of different pre-training objectives on these strong source data. Our experiments suggest that data quality and coverage are the primary drivers of performance, while the choice of objective dictates downstream task specialization.