Spoken DialogSum: An Emotion-Rich Conversational Dataset for Spoken Dialogue Summarization
This addresses a data bottleneck for researchers in spoken dialogue summarization, enabling emotion-aware modeling, though it is incremental as it builds on existing datasets and methods.
The authors tackled the lack of data linking speech, summaries, and paralinguistic cues by introducing Spoken DialogSum, a corpus aligning raw conversational audio with factual and emotion-rich summaries, resulting in a 28% relative improvement in emotional-summary ROUGE-L using an Audio-LLM compared to a cascaded system.
Recent audio language models can follow long conversations. However, research on emotion-aware or spoken dialogue summarization is constrained by the lack of data that links speech, summaries, and paralinguistic cues. We introduce Spoken DialogSum, the first corpus aligning raw conversational audio with factual summaries, emotion-rich summaries, and utterance-level labels for speaker age, gender, and emotion. The dataset is built in two stages: first, an LLM rewrites DialogSum scripts with Switchboard-style fillers and back-channels, then tags each utterance with emotion, pitch, and speaking rate. Second, an expressive TTS engine synthesizes speech from the tagged scripts, aligned with paralinguistic labels. Spoken DialogSum comprises 13,460 emotion-diverse dialogues, each paired with both a factual and an emotion-focused summary. We release an online demo at https://fatfat-emosum.github.io/EmoDialog-Sum-Audio-Samples/, with plans to release the full dataset in the near future. Baselines show that an Audio-LLM raises emotional-summary ROUGE-L by 28% relative to a cascaded ASR-LLM system, confirming the value of end-to-end speech modeling.