SD CL ASJun 23, 2025

USAD: Universal Speech and Audio Representation via Distillation

Heng-Jui Chang, Saurabhchand Bhati, James Glass, Alexander H. Liu

MIT

arXiv:2506.18843v215.69 citationsh-index: 11

Originality Incremental advance

AI Analysis

It addresses the need for a single audio representation model for diverse tasks, offering a practical solution for audio processing applications.

The paper tackled the problem of domain-specific audio representations by proposing USAD, a unified model integrating speech, sound, and music, which achieved near state-of-the-art results on benchmarks like SUPERB and HEAR.

Self-supervised learning (SSL) has revolutionized audio representations, yet models often remain domain-specific, focusing on either speech or non-speech tasks. In this work, we present Universal Speech and Audio Distillation (USAD), a unified approach to audio representation learning that integrates diverse audio types - speech, sound, and music - into a single model. USAD employs efficient layer-to-layer distillation from domain-specific SSL models to train a student on a comprehensive audio dataset. USAD offers competitive performance across various benchmarks and datasets, including frame and instance-level speech processing tasks, audio tagging, and sound classification, achieving near state-of-the-art results with a single encoder on SUPERB and HEAR benchmarks.

View on arXiv PDF

Similar