Audio Contrastive-based Fine-tuning: Decoupling Representation Learning and Classification
This work addresses the challenge of effectively fine-tuning pre-trained audio models for researchers and practitioners, offering a broadly effective strategy to unlock their full potential, though it is incremental as it builds on existing contrastive learning and fine-tuning methods.
The paper tackles the problem of standard fine-tuning coupling representation learning with classifier training, which obscures representation quality, by proposing a disentangled two-stage framework that separates representation refinement from downstream evaluation, resulting in improved accuracy on audio classification tasks, particularly outperforming vanilla fine-tuning on single-label datasets with many classes and strong baselines on multi-label tasks.
Standard fine-tuning of pre-trained audio models couples representation learning with classifier training, which can obscure the true quality of the learned representations. In this work, we advocate for a disentangled two-stage framework that separates representation refinement from downstream evaluation. First, we employ a "contrastive-tuning" stage to explicitly improve the geometric structure of the model's embedding space. Subsequently, we introduce a dual-probe evaluation protocol to assess the quality of these refined representations from a geometric perspective. This protocol uses a linear probe to measure global linear separability and a k-Nearest Neighbours probe to investigate the local structure of class clusters. Our experiments on a diverse set of audio classification tasks show that our framework provides a better foundation for classification, leading to improved accuracy. Our newly proposed dual-probing framework acts as a powerful analytical lens, demonstrating why contrastive learning is more effective by revealing a superior embedding space. It significantly outperforms vanilla fine-tuning, particularly on single-label datasets with a large number of classes, and also surpasses strong baselines on multi-label tasks using a Jaccard-weighted loss. Our findings demonstrate that decoupling representation refinement from classifier training is a broadly effective strategy for unlocking the full potential of pre-trained audio models. Our code will be publicly available.