AISep 29, 2025

Plug-and-Play Emotion Graphs for Compositional Prompting in Zero-Shot Speech Emotion Recognition

Jiacheng Shi, Hongfei Du, Y. Alicia Hong, Ye Gao

arXiv:2509.25458v15.81 citationsh-index: 1

Originality Incremental advance

AI Analysis

This addresses the challenge of accurate emotion recognition in speech for applications like human-computer interaction, though it is incremental as it builds on existing prompting methods.

The paper tackles the problem of weak paralinguistic modeling and limited cross-modal reasoning in large audio-language models for zero-shot speech emotion recognition by proposing Compositional Chain-of-Thought Prompting for Emotion Reasoning (CCoT-Emo), which uses structured Emotion Graphs to guide inference without fine-tuning, resulting in outperforming prior state-of-the-art and improving accuracy over zero-shot baselines.

Large audio-language models (LALMs) exhibit strong zero-shot performance across speech tasks but struggle with speech emotion recognition (SER) due to weak paralinguistic modeling and limited cross-modal reasoning. We propose Compositional Chain-of-Thought Prompting for Emotion Reasoning (CCoT-Emo), a framework that introduces structured Emotion Graphs (EGs) to guide LALMs in emotion inference without fine-tuning. Each EG encodes seven acoustic features (e.g., pitch, speech rate, jitter, shimmer), textual sentiment, keywords, and cross-modal associations. Embedded into prompts, EGs provide interpretable and compositional representations that enhance LALM reasoning. Experiments across SER benchmarks show that CCoT-Emo outperforms prior SOTA and improves accuracy over zero-shot baselines.

View on arXiv PDF

Similar