SDASJun 4

Learning Emotion-discriminative Representations for Zero-Shot Cross-lingual Speech Emotion Recognition

arXiv:2606.0620029.7
AI Analysis

For speech emotion recognition researchers, this method addresses the challenge of cross-lingual generalization without target language annotations, though it is an incremental improvement over existing techniques.

The paper tackles zero-shot cross-lingual speech emotion recognition by proposing an emotion-discriminative representation learning method combining supervised contrastive learning and speaker adversarial learning, achieving significant improvements over conventional training strategies.

Zero-shot cross-lingual speech emotion recognition (SER) remains challenging due to distribution mismatches across languages and the lack of emotion annotations in target language. Under such conditions, models trained solely on source-language data frequently suffer from degraded generalization when evaluated on unseen target languages. To address this limitation, we propose an emotion-discriminative representation learning method that integrates supervised contrastive learning and speaker adversarial learning. The contrastive learning promotes cross-lingual emotion alignment, while speaker adversarial learning suppresses speaker-related cues to encourage speaker-invariant representations. Experimental results under a zero-shot cross-lingual SER setting demonstrate that the proposed method significantly improves SER performance over conventional training strategies.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes