LGMay 29, 2025

EmotionRankCLAP: Bridging Natural Language Speaking Styles and Ordinal Speech Emotion via Rank-N-Contrast

arXiv:2505.23732v14 citationsh-index: 25INTERSPEECH
Originality Incremental advance
AI Analysis

This work addresses the challenge of modeling fine-grained emotion variations in speech and text for applications like affective computing, but it is incremental as it builds on existing CLAP frameworks.

The paper tackles the problem of capturing ordinal relationships in emotions for contrastive language-audio pretraining (CLAP), which existing methods fail to do, and introduces EmotionRankCLAP, a supervised approach that outperforms prior methods in cross-modal retrieval tasks.

Current emotion-based contrastive language-audio pretraining (CLAP) methods typically learn by naïvely aligning audio samples with corresponding text prompts. Consequently, this approach fails to capture the ordinal nature of emotions, hindering inter-emotion understanding and often resulting in a wide modality gap between the audio and text embeddings due to insufficient alignment. To handle these drawbacks, we introduce EmotionRankCLAP, a supervised contrastive learning approach that uses dimensional attributes of emotional speech and natural language prompts to jointly capture fine-grained emotion variations and improve cross-modal alignment. Our approach utilizes a Rank-N-Contrast objective to learn ordered relationships by contrasting samples based on their rankings in the valence-arousal space. EmotionRankCLAP outperforms existing emotion-CLAP methods in modeling emotion ordinality across modalities, measured via a cross-modal retrieval task.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes