CLAISep 20, 2025

Benchmarking Contextual and Paralinguistic Reasoning in Speech-LLMs: A Case Study with In-the-Wild Data

arXiv:2509.16589v24 citationsh-index: 14EMNLP
Originality Synthesis-oriented
AI Analysis

This addresses the problem of limited social and emotional intelligence in speech-LLMs for researchers and developers, but it is incremental as it focuses on benchmarking rather than introducing a new method.

The authors tackled the limitation of speech-LLMs in understanding paralinguistic aspects like emotion and prosody by proposing CP-Bench, a benchmark for evaluating contextual paralinguistic reasoning, which revealed a key gap in existing evaluations and provided insights for building more context-aware models.

Recent speech-LLMs have shown impressive performance in tasks like transcription and translation, yet they remain limited in understanding the paralinguistic aspects of speech crucial for social and emotional intelligence. We propose CP-Bench, a benchmark for evaluating speech-LLMs on contextual paralinguistic reasoning the integration of verbal content with non-verbal cues like emotion and prosody. The benchmark includes two curated question answering (QA) datasets requiring both linguistic and empathetic understanding. We evaluate state-of-the-art speech-LLMs from both open and closed-source models and perform a comprehensive analysis across different question types. The top two models were further analyzed under temperature tuning to understand its effect on this task. Our benchmark reveals a key gap in existing evaluations and offers insights into building more context-aware and emotionally intelligent speech-capable LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes