CL SDApr 29

EmoTransCap: Dataset and Pipeline for Emotion Transition-Aware Speech Captioning in Discourses

Shuhao Xu, Yifan Hu, Jingjing Wu, Zhihao Du, Zheng Lian, Rui Liu

arXiv:2604.2641751.0

AI Analysis

This work addresses the limitation of static single-emotion speech captioning by enabling dynamic emotion transition modeling at the discourse level, benefiting emotionally intelligent conversational agents.

The paper introduces EmoTransCap, the first large-scale dataset for discourse-level emotion transition-aware speech captioning, and proposes a multi-task model for joint emotion transition detection and diarization, enabling controllable emotional speech synthesis. The dataset includes descriptive and instruction-oriented annotations, facilitating fine-grained temporal-dynamic emotion understanding.

Emotion perception and adaptive expression are fundamental capabilities in human-agent interaction. While recent advances in speech emotion captioning (SEC) have improved fine-grained emotional modeling, existing systems remain limited to static, single-emotion characterization within isolated sentences, neglecting dynamic emotional transitions at the discourse level. To address this gap, we propose Emotion Transition-Aware Speech Captioning (EmoTransCap), a paradigm that integrates temporal emotion dynamics with discourse-level speech description. To construct a dataset rich in emotion transitions while enabling scalable expansion, we design an automated pipeline for dataset creation. This is the first large-scale dataset explicitly designed to capture discourse-level emotion transitions. To generate semantically rich descriptions, we incorporate acoustic attributes and temporal cues from discourse-level speech. Our Multi-Task Emotion Transition Recognition (MTETR) model performs joint emotion transition detection and diarization. Leveraging the semantic analysis capabilities of LLMs, we produce two annotation versions: descriptive and instruction-oriented. These data and annotations offer a valuable resource for advancing emotion perception and emotional expressiveness. The dataset enables speech captions that capture emotional transitions, facilitating temporal-dynamic and fine-grained emotion understanding. We also introduce a controllable, transition-aware emotional speech synthesis system at the discourse level, enhancing anthropomorphic emotional expressiveness and supporting emotionally intelligent conversational agents.

View on arXiv PDF

Similar