SD ASJun 4

Learning Emotion-discriminative Representations for Zero-Shot Cross-lingual Speech Emotion Recognition

arXiv:2606.0620029.7

AI Analysis

For speech emotion recognition researchers, this method addresses the challenge of cross-lingual generalization without target language annotations, though it is an incremental improvement over existing techniques.

The paper tackles zero-shot cross-lingual speech emotion recognition by proposing an emotion-discriminative representation learning method combining supervised contrastive learning and speaker adversarial learning, achieving significant improvements over conventional training strategies.

Zero-shot cross-lingual speech emotion recognition (SER) remains challenging due to distribution mismatches across languages and the lack of emotion annotations in target language. Under such conditions, models trained solely on source-language data frequently suffer from degraded generalization when evaluated on unseen target languages. To address this limitation, we propose an emotion-discriminative representation learning method that integrates supervised contrastive learning and speaker adversarial learning. The contrastive learning promotes cross-lingual emotion alignment, while speaker adversarial learning suppresses speaker-related cues to encourage speaker-invariant representations. Experimental results under a zero-shot cross-lingual SER setting demonstrate that the proposed method significantly improves SER performance over conventional training strategies.

View on arXiv PDF

Similar