CL SD ASSep 25, 2025

MI-Fuse: Label Fusion for Unsupervised Domain Adaptation with Closed-Source Large-Audio Language Model

Hsiao-Ying Huang, Yi-Cheng Lin, Hung-yi Lee

arXiv:2509.20706v14.91 citationsh-index: 11

Originality Incremental advance

AI Analysis

This work addresses domain adaptation for speech emotion recognition in real-world deployments where source data cannot be shared, offering a practical solution for emotion-aware speech systems.

The paper tackled the problem of domain mismatch in speech emotion recognition when source data are unavailable and only an API-accessible large audio-language model is present, proposing MI-Fuse to adapt a student model using label fusion, which achieved a 3.9% improvement over the strongest baseline in experiments across three datasets and six cross-domain transfers.

Large audio-language models (LALMs) show strong zero-shot ability on speech tasks, suggesting promise for speech emotion recognition (SER). However, SER in real-world deployments often fails under domain mismatch, where source data are unavailable and powerful LALMs are accessible only through an API. We ask: given only unlabeled target-domain audio and an API-only LALM, can a student model be adapted to outperform the LALM in the target domain? To this end, we propose MI-Fuse, a denoised label fusion framework that supplements the LALM with a source-domain trained SER classifier as an auxiliary teacher. The framework draws multiple stochastic predictions from both teachers, weights their mean distributions by mutual-information-based uncertainty, and stabilizes training with an exponential moving average teacher. Experiments across three public emotion datasets and six cross-domain transfers show consistent gains, with the student surpassing the LALM and outperforming the strongest baseline by 3.9%. This approach strengthens emotion-aware speech systems without sharing source data, enabling realistic adaptation.

View on arXiv PDF

Similar