CLSDApr 29

EmoTransCap: Dataset and Pipeline for Emotion Transition-Aware Speech Captioning in Discourses

arXiv:2604.2641751.0
AI Analysis

This work addresses the limitation of static single-emotion speech captioning by enabling dynamic emotion transition modeling at the discourse level, benefiting emotionally intelligent conversational agents.

The paper introduces EmoTransCap, the first large-scale dataset for discourse-level emotion transition-aware speech captioning, and proposes a multi-task model for joint emotion transition detection and diarization, enabling controllable emotional speech synthesis. The dataset includes descriptive and instruction-oriented annotations, facilitating fine-grained temporal-dynamic emotion understanding.

Emotion perception and adaptive expression are fundamental capabilities in human-agent interaction. While recent advances in speech emotion captioning (SEC) have improved fine-grained emotional modeling, existing systems remain limited to static, single-emotion characterization within isolated sentences, neglecting dynamic emotional transitions at the discourse level. To address this gap, we propose Emotion Transition-Aware Speech Captioning (EmoTransCap), a paradigm that integrates temporal emotion dynamics with discourse-level speech description. To construct a dataset rich in emotion transitions while enabling scalable expansion, we design an automated pipeline for dataset creation. This is the first large-scale dataset explicitly designed to capture discourse-level emotion transitions. To generate semantically rich descriptions, we incorporate acoustic attributes and temporal cues from discourse-level speech. Our Multi-Task Emotion Transition Recognition (MTETR) model performs joint emotion transition detection and diarization. Leveraging the semantic analysis capabilities of LLMs, we produce two annotation versions: descriptive and instruction-oriented. These data and annotations offer a valuable resource for advancing emotion perception and emotional expressiveness. The dataset enables speech captions that capture emotional transitions, facilitating temporal-dynamic and fine-grained emotion understanding. We also introduce a controllable, transition-aware emotional speech synthesis system at the discourse level, enhancing anthropomorphic emotional expressiveness and supporting emotionally intelligent conversational agents.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes