ASLGMar 24

Prompt Amplification and Zero-Shot Late Fusion in Audio-Language Models for Speech Emotion Recognition

arXiv:2603.2305735.7h-index: 3
AI Analysis

This work addresses speech emotion recognition for applications in human-computer interaction, but it is incremental as it builds on existing models with a fusion approach.

The paper tackled improving speech emotion recognition by combining zero-shot audio-language models with specialist foundation models, achieving state-of-the-art performance with reported improvements over baselines like WavLM-Large on three datasets.

Audio-Language Models (ALMs) are making strides in understanding speech and non-speech audio. However, domain-specialist Foundation Models (FMs) remain the best for closed-ended speech processing tasks such as Speech Emotion Recognition (SER). Using ALMs for Zero-shot SER is a popular choice, but their potential to work with specialists to achieve state-of-the-art (SOTA) performance remains unexplored. We propose ZS-Fuse, a late-fusion method that combines zero-shot emotion estimates from a dual-encoder ALM with specialist FMs. To handle ambiguity in emotions and sensitivity to prompt choice, 1) we use a simple prompt ensemble and 2) suggest a novel technique called prompt amplification, which repeats audio and text queries to discover stronger zero-shot capabilities. We demonstrate the efficacy of our technique by evaluating ZS-Fuse with three dual-encoder ALMs and two FMs, and report improvements over SOTA baselines, such as WavLM-Large, on three speech emotion recognition datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes