SDAIASMay 23, 2025

ABHINAYA -- A System for Speech Emotion Recognition In Naturalistic Conditions Challenge

DeepMind
arXiv:2505.18217v11 citationsh-index: 30INTERSPEECH
Originality Incremental advance
AI Analysis

This work addresses the challenge of recognizing emotions from speech in real-world conditions, which is incremental as it builds on existing methods for a specific domain.

The authors tackled speech emotion recognition in naturalistic settings by developing the Abhinaya system, which integrated speech-based, text-based, and speech-text models with tailored loss functions and majority voting, achieving state-of-the-art performance among published results after training completion.

Speech emotion recognition (SER) in naturalistic settings remains a challenge due to the intrinsic variability, diverse recording conditions, and class imbalance. As participants in the Interspeech Naturalistic SER Challenge which focused on these complexities, we present Abhinaya, a system integrating speech-based, text-based, and speech-text models. Our approach fine-tunes self-supervised and speech large language models (SLLM) for speech representations, leverages large language models (LLM) for textual context, and employs speech-text modeling with an SLLM to capture nuanced emotional cues. To combat class imbalance, we apply tailored loss functions and generate categorical decisions through majority voting. Despite one model not being fully trained, the Abhinaya system ranked 4th among 166 submissions. Upon completion of training, it achieved state-of-the-art performance among published results, demonstrating the effectiveness of our approach for SER in real-world conditions.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes