Incorporating Contextual Paralinguistic Understanding in Large Speech-Language Models
This addresses the challenge of empathetic reasoning in speech-language models for applications like human-computer interaction, though it is incremental as it builds on existing methods.
The paper tackled the problem of limited empathetic reasoning in large speech-language models by incorporating contextual paralinguistic information through explicit and implicit methods, resulting in a performance boost of 38.41% with the implicit method and 46.02% when combined.
Current large speech language models (Speech-LLMs) often exhibit limitations in empathetic reasoning, primarily due to the absence of training datasets that integrate both contextual content and paralinguistic cues. In this work, we propose two approaches to incorporate contextual paralinguistic information into model training: (1) an explicit method that provides paralinguistic metadata (e.g., emotion annotations) directly to the LLM, and (2) an implicit method that automatically generates novel training question-answer (QA) pairs using both categorical and dimensional emotion annotations alongside speech transcriptions. Our implicit method boosts performance (LLM-judged) by 38.41% on a human-annotated QA benchmark, reaching 46.02% when combined with the explicit approach, showing effectiveness in contextual paralinguistic understanding. We also validate the LLM judge by demonstrating its correlation with classification metrics, providing support for its reliability.