SD CL ASFeb 25, 2025

Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought

Zhixian Zhao, Xinfa Zhu, Xinsheng Wang, Shuiyuan Wang, Xuelong Geng, Wenjie Tian, Lei Xie

arXiv:2502.18186v116.710 citationsh-index: 13Has CodeIEEE Transactions on Audio, Speech, and Language Processing

Originality Incremental advance

AI Analysis

This addresses the issue of unreliable emotion recognition in audio models for applications like human-computer interaction, though it appears incremental as it builds on existing ALM and SER methods.

The paper tackles the problem of hallucinations and misclassifications in speech emotion recognition (SER) using large-scale audio language models (ALMs), proposing C^2SER which integrates contextual perception and chain of thought to enhance stability and accuracy, with experiments showing it outperforms existing ALMs like Qwen2-Audio and SECap.

Large-scale audio language models (ALMs), such as Qwen2-Audio, are capable of comprehending diverse audio signal, performing audio analysis and generating textual responses. However, in speech emotion recognition (SER), ALMs often suffer from hallucinations, resulting in misclassifications or irrelevant outputs. To address these challenges, we propose C$^2$SER, a novel ALM designed to enhance the stability and accuracy of SER through Contextual perception and Chain of Thought (CoT). C$^2$SER integrates the Whisper encoder for semantic perception and Emotion2Vec-S for acoustic perception, where Emotion2Vec-S extends Emotion2Vec with semi-supervised learning to enhance emotional discrimination. Additionally, C$^2$SER employs a CoT approach, processing SER in a step-by-step manner while leveraging speech content and speaking styles to improve recognition. To further enhance stability, C$^2$SER introduces self-distillation from explicit CoT to implicit CoT, mitigating error accumulation and boosting recognition accuracy. Extensive experiments show that C$^2$SER outperforms existing popular ALMs, such as Qwen2-Audio and SECap, delivering more stable and precise emotion recognition. We release the training code, checkpoints, and test sets to facilitate further research.

View on arXiv PDF Code

Similar