CLLGASNov 15, 2023

Towards Generalizable SER: Soft Labeling and Data Augmentation for Modeling Temporal Emotion Shifts in Large-Scale Multilingual Speech

arXiv:2311.08607v14 citationsh-index: 7Has Code
Originality Incremental advance
AI Analysis

This addresses emotion detection for human-machine interaction, though it appears incremental with a hybrid method.

The study tackled cross-corpus bias in speech emotion recognition by combining 16 multilingual datasets and proposing a soft labeling system to capture emotional intensities, achieving notable zero-shot generalization on four multilingual datasets.

Recognizing emotions in spoken communication is crucial for advanced human-machine interaction. Current emotion detection methodologies often display biases when applied cross-corpus. To address this, our study amalgamates 16 diverse datasets, resulting in 375 hours of data across languages like English, Chinese, and Japanese. We propose a soft labeling system to capture gradational emotional intensities. Using the Whisper encoder and data augmentation methods inspired by contrastive learning, our method emphasizes the temporal dynamics of emotions. Our validation on four multilingual datasets demonstrates notable zero-shot generalization. We publish our open source model weights and initial promising results after fine-tuning on Hume-Prosody.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes