A Unified Spoken Language Model with Injected Emotional-Attribution Thinking for Human-like Interaction
This work addresses emotional intelligence in human-like spoken dialogue systems, representing an incremental improvement with a novel method for a known bottleneck.
The paper tackles the problem of emotional intelligence in spoken language models by introducing Injected Emotional-Attribution Thinking (IEAT), which incorporates user emotional states and their causes into internal reasoning, achieving top-ranked performance on the HumDial benchmark for emotional trajectory modeling, reasoning, and empathetic response generation.
This paper presents a unified spoken language model for emotional intelligence, enhanced by a novel data construction strategy termed Injected Emotional-Attribution Thinking (IEAT). IEAT incorporates user emotional states and their underlying causes into the model's internal reasoning process, enabling emotion-aware reasoning to be internalized rather than treated as explicit supervision. The model is trained with a two-stage progressive strategy. The first stage performs speech-text alignment and emotional attribute modeling via self-distillation, while the second stage conducts end-to-end cross-modal joint optimization to ensure consistency between textual and spoken emotional expressions. Experiments on the Human-like Spoken Dialogue Systems Challenge (HumDial) Emotional Intelligence benchmark demonstrate that the proposed approach achieves top-ranked performance across emotional trajectory modeling, emotional reasoning, and empathetic response generation under both LLM-based and human evaluations.