Aligning Paralinguistic Understanding and Generation in Speech LLMs via Multi-Task Reinforcement Learning
This work addresses the problem of building emotionally intelligent speech LLMs for applications requiring nuanced understanding, though it is incremental as it builds on existing methods with specific enhancements.
The paper tackled the challenge of leveraging paralinguistic cues like prosody and emotion in speech LLMs, which face issues like limited data and lexical shortcuts, by proposing a multi-task reinforcement learning approach with a two-stage pipeline, resulting in improvements of 8-12% over baselines on datasets such as Expresso, IEMOCAP, and RAVDESS.
Speech large language models (LLMs) observe paralinguistic cues such as prosody, emotion, and non-verbal sounds--crucial for intent understanding. However, leveraging these cues faces challenges: limited training data, annotation difficulty, and models exploiting lexical shortcuts over paralinguistic signals. We propose multi-task reinforcement learning (RL) with chain-of-thought prompting that elicits explicit affective reasoning. To address data scarcity, we introduce a paralinguistics-aware speech LLM (PALLM) that jointly optimizes sentiment classification from audio and paralinguistics-aware response generation via a two-stage pipeline. Experiments demonstrate that our approach improves paralinguistics understanding over both supervised baselines and strong proprietary models (Gemini-2.5-Pro, GPT-4o-audio) by 8-12% on Expresso, IEMOCAP, and RAVDESS. The results show that modeling paralinguistic reasoning with multi-task RL is crucial for building emotionally intelligent speech LLMs.