SD AIFeb 18, 2025

Keep what you need : extracting efficient subnetworks from large audio representation models

David Genova, Philippe Esling, Tom Hurlin

arXiv:2502.12925v17.01 citationsh-index: 27Has CodeICASSP

Originality Incremental advance

AI Analysis

This work addresses the deployment inefficiency of large audio models for real-time and device-limited applications, offering an incremental improvement in model compression.

The paper tackles the problem of large, complex audio foundation models being unsuitable for deployment on consumer devices and real-time applications by introducing a method to extract lightweight specialist subnetworks. The result is a significant reduction in computational cost while maintaining performance on downstream tasks, as demonstrated across three different audio foundation models and various audio types.

Recently, research on audio foundation models has witnessed notable advances, as illustrated by the ever improving results on complex downstream tasks. Subsequently, those pretrained networks have quickly been used for various audio applications. These improvements have however resulted in a considerable increase both in size and complexity of these models. Along the environmental concerns this issue raises, this prevents the deployment of such networks on consumer-level devices, and precludes their use for real-time applications. Moreover, this appears contradictory with the specificity of the tasks for which these models are used, which are often simpler compared to extracting a rich, multi-purpose representation from any type of audio data. In this paper, we address this issue with a simple, yet effective method to extract lightweight specialist subnetworks from large foundation models. Specifically, we introduce learnable binary masks in-between the layers of a pretrained representation model. When training the end-to-end model on a downstream task, we add a sparsity-inducing loss to the overall objective, hence learning a compact subnetwork specialized on a single task. Importantly, the weights of the foundation model are kept frozen, resulting into low additional training costs. Once trained, the masked computational units can then be removed from the network, implying significant performance gains. We assess our method on three widespread audio foundation models, each based on a different backbone architecture, and illustrate its effectiveness on common audio representation evaluation tasks, as well as its versatility on both speech, music, and general audio. Code for reproducing the results and supporting webpage are available at https://github.com/gnvIRCAM/Audio-representation-trimming

View on arXiv PDF Code

Similar