CLLGSep 19, 2025

Speaker Style-Aware Phoneme Anchoring for Improved Cross-Lingual Speech Emotion Recognition

arXiv:2509.20373v11 citationsh-index: 7
Originality Incremental advance
AI Analysis

This work addresses the problem of cross-lingual emotion recognition for speech processing applications, representing an incremental improvement with domain-specific focus.

The paper tackles cross-lingual speech emotion recognition by proposing a speaker-style aware phoneme anchoring framework to align emotional expression across languages and speakers, resulting in improved generalization on English and Taiwanese Mandarin datasets over competitive baselines.

Cross-lingual speech emotion recognition (SER) remains a challenging task due to differences in phonetic variability and speaker-specific expressive styles across languages. Effectively capturing emotion under such diverse conditions requires a framework that can align the externalization of emotions across different speakers and languages. To address this problem, we propose a speaker-style aware phoneme anchoring framework that aligns emotional expression at the phonetic and speaker levels. Our method builds emotion-specific speaker communities via graph-based clustering to capture shared speaker traits. Using these groups, we apply dual-space anchoring in speaker and phonetic spaces to enable better emotion transfer across languages. Evaluations on the MSP-Podcast (English) and BIIC-Podcast (Taiwanese Mandarin) corpora demonstrate improved generalization over competitive baselines and provide valuable insights into the commonalities in cross-lingual emotion representation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes