CLAIMar 25, 2025

Large Language Models Meet Contrastive Learning: Zero-Shot Emotion Recognition Across Languages

arXiv:2503.21806v14 citationsh-index: 13ICME
Originality Incremental advance
AI Analysis

This addresses the problem of cross-language emotion recognition for applications in human-computer interaction, though it is incremental as it builds on existing contrastive learning and LLM methods.

The paper tackled zero-shot multilingual speech emotion recognition by leveraging contrastive learning and large language models, achieving effective performance on unseen datasets and languages.

Multilingual speech emotion recognition aims to estimate a speaker's emotional state using a contactless method across different languages. However, variability in voice characteristics and linguistic diversity poses significant challenges for zero-shot speech emotion recognition, especially with multilingual datasets. In this paper, we propose leveraging contrastive learning to refine multilingual speech features and extend large language models for zero-shot multilingual speech emotion estimation. Specifically, we employ a novel two-stage training framework to align speech signals with linguistic features in the emotional space, capturing both emotion-aware and language-agnostic speech representations. To advance research in this field, we introduce a large-scale synthetic multilingual speech emotion dataset, M5SER. Our experiments demonstrate the effectiveness of the proposed method in both speech emotion recognition and zero-shot multilingual speech emotion recognition, including previously unseen datasets and languages.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes