SDMar 10

EmoSURA: Towards Accurate Evaluation of Detailed and Long-Context Emotional Speech Captions

Xin Jing, Andreas Triantafyllopoulos, Jiadong Wang, Shahin Amiriparian, Jun Luo, Björn Schuller

arXiv:2603.09820v111.0h-index: 26

Predicted impact top 22% in SD · last 90 daysOriginality Highly original

AI Analysis

This addresses a critical bottleneck in speech captioning evaluation for researchers and developers, though it is incremental as it builds on existing evaluation challenges.

The paper tackles the problem of evaluating detailed and long-context emotional speech captions by proposing EmoSURA, a framework that shifts from holistic scoring to atomic verification, achieving a positive correlation with human judgments compared to traditional metrics that showed negative correlations.

Recent advancements in speech captioning models have enabled the generation of rich, fine-grained captions for emotional speech. However, the evaluation of such captions remains a critical bottleneck: traditional N-gram metrics fail to capture semantic nuances, while LLM judges often suffer from reasoning inconsistency and context-collapse when processing long-form descriptions. In this work, we propose EmoSURA, a novel evaluation framework that shifts the paradigm from holistic scoring to atomic verification. EmoSURA decomposes complex captions into Atomic Perceptual Units, which are self-contained statements regarding vocal or emotional attributes, and employs an audio-grounded verification mechanism to validate each unit against the raw speech signal. Furthermore, we address the scarcity of standardized evaluation resources by introducing SURABench, a carefully balanced and stratified benchmark. Our experiments show that EmoSURA achieves a positive correlation with human judgments, offering a more reliable assessment for long-form captions compared to traditional metrics, which demonstrated negative correlations due to their sensitivity to caption length.

View on arXiv PDF

Similar