CLASOct 29, 2025

Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech

arXiv:2510.25054v24 citationsh-index: 2
Originality Synthesis-oriented
AI Analysis

This work addresses the generalization and modality integration issues in SLMs for emotion recognition, providing insights for researchers and developers in speech processing, though it is incremental in nature.

The study evaluated four spoken language models (SLMs) on speech emotion recognition using emotionally incongruent speech samples, where text and speech express different emotions, and found that SLMs rely more on textual semantics than acoustic cues for emotion recognition.

Advancements in spoken language processing have driven the development of spoken language models (SLMs), designed to achieve universal audio understanding by jointly learning text and audio representations for a wide range of tasks. Although promising results have been achieved, there is growing discussion regarding these models' generalization capabilities and the extent to which they truly integrate audio and text modalities in their internal representations. In this work, we evaluate four SLMs on the task of speech emotion recognition using a dataset of emotionally incongruent speech samples, a condition under which the semantic content of the spoken utterance conveys one emotion while speech expressiveness conveys another. Our results indicate that SLMs rely predominantly on textual semantics rather than speech emotion to perform the task, indicating that text-related representations largely dominate over acoustic representations. We release both the code and the Emotionally Incongruent Synthetic Speech dataset (EMIS) to the community.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes