CLSDASOct 24, 2024

AlignCap: Aligning Speech Emotion Captioning to Human Preferences

arXiv:2410.19134v129 citationsh-index: 2EMNLP
Originality Incremental advance
AI Analysis

This work addresses the challenge of accurately describing complex speech emotions in natural language for applications in human-computer interaction, though it appears incremental as it builds on existing SEC methods with LLM-based improvements.

The paper tackles the problem of hallucinations and poor generalization in Speech Emotion Captioning (SEC) by proposing AlignCap, which aligns the task to human preferences using a large language model with speech-text and human preference alignment, resulting in stronger performance on zero-shot SEC compared to state-of-the-art methods.

Speech Emotion Captioning (SEC) has gradually become an active research task. The emotional content conveyed through human speech are often complex, and classifying them into fixed categories may not be enough to fully capture speech emotions. Describing speech emotions through natural language may be a more effective approach. However, existing SEC methods often produce hallucinations and lose generalization on unseen speech. To overcome these problems, we propose AlignCap, which Aligning Speech Emotion Captioning to Human Preferences based on large language model (LLM) with two properties: 1) Speech-Text Alignment, which minimizing the divergence between the LLM's response prediction distributions for speech and text inputs using knowledge distillation (KD) Regularization. 2) Human Preference Alignment, where we design Preference Optimization (PO) Regularization to eliminate factuality and faithfulness hallucinations. We also extract emotional clues as a prompt for enriching fine-grained information under KD-Regularization. Experiments demonstrate that AlignCap presents stronger performance to other state-of-the-art methods on Zero-shot SEC task.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes