Carlos Busso

AS
h-index32
34papers
682citations
Novelty47%
AI Score55

34 Papers

ASOct 31, 2025Code
NaturalVoices: A Large-Scale, Spontaneous and Emotional Podcast Dataset for Voice Conversion

Zongyang Du, Shreeram Suresh Chandra, Ismail Rasim Ulgen et al.

Everyday speech conveys far more than words, it reflects who we are, how we feel, and the circumstances surrounding our interactions. Yet, most existing speech datasets are acted, limited in scale, and fail to capture the expressive richness of real-life communication. With the rise of large neural networks, several large-scale speech corpora have emerged and been widely adopted across various speech processing tasks. However, the field of voice conversion (VC) still lacks large-scale, expressive, and real-life speech resources suitable for modeling natural prosody and emotion. To fill this gap, we release NaturalVoices (NV), the first large-scale spontaneous podcast dataset specifically designed for emotion-aware voice conversion. It comprises 5,049 hours of spontaneous podcast recordings with automatic annotations for emotion (categorical and attribute-based), speech quality, transcripts, speaker identity, and sound events. The dataset captures expressive emotional variation across thousands of speakers, diverse topics, and natural speaking styles. We also provide an open-source pipeline with modular annotation tools and flexible filtering, enabling researchers to construct customized subsets for a wide range of VC tasks. Experiments demonstrate that NaturalVoices supports the development of robust and generalizable VC models capable of producing natural, expressive speech, while revealing limitations of current architectures when applied to large-scale spontaneous data. These results suggest that NaturalVoices is both a valuable resource and a challenging benchmark for advancing the field of voice conversion. Dataset is available at: https://huggingface.co/JHU-SmileLab

SDJul 6, 2024
A Layer-Anchoring Strategy for Enhancing Cross-Lingual Speech Emotion Recognition

Shreya G. Upadhyay, Carlos Busso, Chi-Chun Lee

Cross-lingual speech emotion recognition (SER) is important for a wide range of everyday applications. While recent SER research relies heavily on large pretrained models for emotion training, existing studies often concentrate solely on the final transformer layer of these models. However, given the task-specific nature and hierarchical architecture of these models, each transformer layer encapsulates different levels of information. Leveraging this hierarchical structure, our study focuses on the information embedded across different layers. Through an examination of layer feature similarity across different languages, we propose a novel strategy called a layer-anchoring mechanism to facilitate emotion transfer in cross-lingual SER tasks. Our approach is evaluated using two distinct language affective corpora (MSP-Podcast and BIIC-Podcast), achieving a best UAR performance of 60.21% on the BIIC-podcast corpus. The analysis uncovers interesting insights into the behavior of popular pretrained models.

CVMar 15, 2022
Driving Anomaly Detection Using Conditional Generative Adversarial Network

Yuning Qiu, Teruhisa Misu, Carlos Busso

Anomaly driving detection is an important problem in advanced driver assistance systems (ADAS). It is important to identify potential hazard scenarios as early as possible to avoid potential accidents. This study proposes an unsupervised method to quantify driving anomalies using a conditional generative adversarial network (GAN). The approach predicts upcoming driving scenarios by conditioning the models on the previously observed signals. The system uses the difference of the output from the discriminator between the predicted and actual signals as a metric to quantify the anomaly degree of a driving segment. We take a driver-centric approach, considering physiological signals from the driver and controller area network-Bus (CAN-Bus) signals from the vehicle. The approach is implemented with convolutional neural networks (CNNs) to extract discriminative feature representations, and with long short-term memory (LSTM) cells to capture temporal information. The study is implemented and evaluated with the driving anomaly dataset (DAD), which includes 250 hours of naturalistic recordings manually annotated with driving events. The experimental results reveal that recordings annotated with events that are likely to be anomalous, such as avoiding on-road pedestrians and traffic rule violations, have higher anomaly scores than recordings without any event annotation. The results are validated with perceptual evaluations, where annotators are asked to assess the risk and familiarity of the videos detected with high anomaly scores. The results indicate that the driving segments with higher anomaly scores are more risky and less regularly seen on the road than other driving segments, validating the proposed unsupervised approach.

CLJan 9
On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation

Jeff Chan-Jan Sju, Liang-Hsuan Tseng, Yi-Cheng Lin et al. · mit

Generative spoken language models pretrained on large-scale raw audio can continue a speech prompt with appropriate content while preserving attributes like speaker and emotion, serving as foundation models for spoken dialogue. In prior literature, these models are often evaluated using ``global token perplexity'', which directly applies the text perplexity formulation to speech tokens. However, this practice overlooks fundamental differences between speech and text modalities, possibly leading to an underestimation of the speech characteristics. In this work, we propose a variety of likelihood- and generative-based evaluation methods that serve in place of naive global token perplexity. We demonstrate that the proposed evaluations more faithfully reflect perceived generation quality, as evidenced by stronger correlations with human-rated mean opinion scores (MOS). When assessed under the new metrics, the relative performance landscape of spoken language models is reshaped, revealing a significantly reduced gap between the best-performing model and the human topline. Together, these results suggest that appropriate evaluation is critical for accurately assessing progress in spoken language modeling.

SDJul 25, 2024
Describe Where You Are: Improving Noise-Robustness for Speech Emotion Recognition with Text Description of the Environment

Seong-Gyun Leem, Daniel Fulford, Jukka-Pekka Onnela et al.

Speech emotion recognition (SER) systems often struggle in real-world environments, where ambient noise severely degrades their performance. This paper explores a novel approach that exploits prior knowledge of testing environments to maximize SER performance under noisy conditions. To address this task, we propose a text-guided, environment-aware training where an SER model is trained with contaminated speech samples and their paired noise description. We use a pre-trained text encoder to extract the text-based environment embedding and then fuse it to a transformer-based SER model during training and inference. We demonstrate the effectiveness of our approach through our experiment with the MSP-Podcast corpus and real-world additive noise samples collected from the Freesound and DEMAND repositories. Our experiment indicates that the text-based environment descriptions processed by a large language model (LLM) produce representations that improve the noise-robustness of the SER system. With a contrastive learning (CL)-based representation, our proposed method can be improved by jointly fine-tuning the text encoder with the emotion recognition model. Under the -5dB signal-to-noise ratio (SNR) level, fine-tuning the text encoder improves our CL-based representation method by 76.4% (arousal), 100.0% (dominance), and 27.7% (valence).

ASJul 5, 2024
Rethinking Speaker Embeddings for Speech Generation: Sub-Center Modeling for Capturing Intra-Speaker Diversity

Ismail Rasim Ulgen, John H. L. Hansen, Carlos Busso et al.

Modeling the rich prosodic variations inherent in human speech is essential for generating natural-sounding speech. While speaker embeddings are commonly used as conditioning inputs in personalized speech generation, they are typically optimized for speaker recognition, which encourages the loss of intra-speaker variation. This strategy makes them suboptimal for speech generation in terms of modeling the rich variations at the output speech distribution. In this work, we propose a novel speaker embedding network that employs multiple sub-centers per speaker class during training, instead of a single center as in conventional approaches. This sub-center modeling allows the embedding to capture a broader range of speaker-specific variations while maintaining speaker classification performance. We demonstrate the effectiveness of the proposed embeddings on a voice conversion task, showing improved naturalness and prosodic expressiveness in the synthesized speech.

ASJan 23
Recovering Performance in Speech Emotion Recognition from Discrete Tokens via Multi-Layer Fusion and Paralinguistic Feature Integration

Esther Sun, Abinay Reddy Naini, Carlos Busso

Discrete speech tokens offer significant advantages for storage and language model integration, but their application in speech emotion recognition (SER) is limited by paralinguistic information loss during quantization. This paper presents a comprehensive investigation of discrete tokens for SER. Using a fine-tuned WavLM-Large model, we systematically quantify performance degradation across different layer configurations and k-means quantization granularities. To recover the information loss, we propose two key strategies: (1) attention-based multi-layer fusion to recapture complementary information from different layers, and (2) integration of openSMILE features to explicitly reintroduce paralinguistic cues. We also compare mainstream neural codec tokenizers (SpeechTokenizer, DAC, EnCodec) and analyze their behaviors when fused with acoustic features. Our findings demonstrate that through multi-layer fusion and acoustic feature integration, discrete tokens can close the performance gap with continuous representations in SER tasks.

ASMar 23
MSP-Conversation: A Corpus for Naturalistic, Time-Continuous Emotion Recognition

Luz Martinez-Lucas, Pravin Mote, Abinay Reddy Naini et al.

Affective computing aims to understand and model human emotions for computational systems. Within this field, speech emotion recognition (SER) focuses on predicting emotions conveyed through speech. While early SER systems relied on limited datasets and traditional machine learning models, recent deep learning approaches demand largescale, naturalistic emotional corpora. To address this need, we introduce the MSP-Conversation corpus: a dataset of more than 70 hours of conversational audio with time-continuous emotional annotations and detailed speaker diarizations. The time-continuous annotations capture the dynamic and contextdependent nature of emotional expression. The annotations in the corpus include fine-grained temporal traces of valence, arousal, and dominance. The audio data is sourced from publicly available podcasts and overlaps with a subset of the isolated speaking turns in the MSP-Podcast corpus to facilitate direct comparisons between annotation methods (i.e., in-context versus out-of-context annotations). The paper outlines the development of the corpus, annotation methodology, analyses of the annotations, and baseline SER experiments, establishing the MSP-Conversation corpus as a valuable resource for advancing research in dynamic SER in naturalistic settings.

LGFeb 13
ADEPT: RL-Aligned Agentic Decoding of Emotion via Evidence Probing Tools -- From Consensus Learning to Ambiguity-Driven Emotion Reasoning

Esther Sun, Bo-Hao Su, Abinay Reddy Naini et al.

Speech Large Language Models (SLLMs) enable high-level emotion reasoning but often produce ungrounded, text-biased judgments without verifiable acoustic evidence. In contrast, self-supervised speech encoders such as WavLM provide strong acoustic representations yet remain opaque discriminative models with limited interpretability. To bridge this gap, we introduce ADEPT (Agentic Decoding of Emotion via Evidence Probing Tools), a framework that reframes emotion recognition as a multi-turn inquiry process rather than a single-pass prediction. ADEPT transforms an SLLM into an agent that maintains an evolving candidate emotion set and adaptively invokes dedicated semantic and acoustic probing tools within a structured pipeline of candidate generation, evidence collection, and adjudication. Crucially, ADEPT enables a paradigm shift from consensus learning to ambiguity-driven emotion reasoning. Since human affect exhibits inherent complexity and frequent co-occurrence of emotions, we treat minority annotations as informative perceptual signals rather than discarding them as noise. Finally, we integrate Group Relative Policy Optimization (GRPO) with an Evidence Trust Gate to explicitly couple tool-usage behaviors with prediction quality and enforce evidence-grounded reasoning. Experiments show that ADEPT improves primary emotion accuracy in most settings while substantially improving minor emotion characterization, producing explanations grounded in auditable acoustic and semantic evidence.

ASMay 24, 2025Code
Towards Emotionally Consistent Text-Based Speech Editing: Introducing EmoCorrector and The ECD-TSE Dataset

Rui Liu, Pu Gao, Jiatian Xi et al.

Text-based speech editing (TSE) modifies speech using only text, eliminating re-recording. However, existing TSE methods, mainly focus on the content accuracy and acoustic consistency of synthetic speech segments, and often overlook the emotional shifts or inconsistency issues introduced by text changes. To address this issue, we propose EmoCorrector, a novel post-correction scheme for TSE. EmoCorrector leverages Retrieval-Augmented Generation (RAG) by extracting the edited text's emotional features, retrieving speech samples with matching emotions, and synthesizing speech that aligns with the desired emotion while preserving the speaker's identity and quality. To support the training and evaluation of emotional consistency modeling in TSE, we pioneer the benchmarking Emotion Correction Dataset for TSE (ECD-TSE). The prominent aspect of ECD-TSE is its inclusion of $<$text, speech$>$ paired data featuring diverse text variations and a range of emotional expressions. Subjective and objective experiments and comprehensive analysis on ECD-TSE confirm that EmoCorrector significantly enhances the expression of intended emotion while addressing emotion inconsistency limitations in current TSE methods. Code and audio examples are available at https://github.com/AI-S2-Lab/EmoCorrector.

MMAug 6, 2019Code
Report of 2017 NSF Workshop on Multimedia Challenges, Opportunities and Research Roadmaps

Shih-Fu Chang, Alex Hauptmann, Louis-Philippe Morency et al.

With the transformative technologies and the rapidly changing global R&D landscape, the multimedia and multimodal community is now faced with many new opportunities and uncertainties. With the open source dissemination platform and pervasive computing resources, new research results are being discovered at an unprecedented pace. In addition, the rapid exchange and influence of ideas across traditional discipline boundaries have made the emphasis on multimedia multimodal research even more important than before. To seize these opportunities and respond to the challenges, we have organized a workshop to specifically address and brainstorm the challenges, opportunities, and research roadmaps for MM research. The two-day workshop, held on March 30 and 31, 2017 in Washington DC, was sponsored by the Information and Intelligent Systems Division of the National Science Foundation of the United States. Twenty-three (23) invited participants were asked to review and identify research areas in the MM field that are most important over the next 10-15 year timeframe. Important topics were selected through discussion and consensus, and then discussed in depth in breakout groups. Breakout groups reported initial discussion results to the whole group, who continued with further extensive deliberation. For each identified topic, a summary was produced after the workshop to describe the main findings, including the state of the art, challenges, and research roadmaps planned for the next 5, 10, and 15 years in the identified area.

ASMay 29, 2025
Can Emotion Fool Anti-spoofing?

Aurosweta Mahapatra, Ismail Rasim Ulgen, Abinay Reddy Naini et al.

Traditional anti-spoofing focuses on models and datasets built on synthetic speech with mostly neutral state, neglecting diverse emotional variations. As a result, their robustness against high-quality, emotionally expressive synthetic speech is uncertain. We address this by introducing EmoSpoof-TTS, a corpus of emotional text-to-speech samples. Our analysis shows existing anti-spoofing models struggle with emotional synthetic speech, exposing risks of emotion-targeted attacks. Even trained on emotional data, the models underperform due to limited focus on emotional aspect and show performance disparities across emotions. This highlights the need for emotion-focused anti-spoofing paradigm in both dataset and methodology. We propose GEM, a gated ensemble of emotion-specialized models with a speech emotion recognition gating network. GEM performs effectively across all emotions and neutral state, improving defenses against spoofing attacks. We release the EmoSpoof-TTS Dataset: https://emospoof-tts.github.io/Dataset/

LGMay 29, 2025
EmotionRankCLAP: Bridging Natural Language Speaking Styles and Ordinal Speech Emotion via Rank-N-Contrast

Shreeram Suresh Chandra, Lucas Goncalves, Junchen Lu et al.

Current emotion-based contrastive language-audio pretraining (CLAP) methods typically learn by naïvely aligning audio samples with corresponding text prompts. Consequently, this approach fails to capture the ordinal nature of emotions, hindering inter-emotion understanding and often resulting in a wide modality gap between the audio and text embeddings due to insufficient alignment. To handle these drawbacks, we introduce EmotionRankCLAP, a supervised contrastive learning approach that uses dimensional attributes of emotional speech and natural language prompts to jointly capture fine-grained emotion variations and improve cross-modal alignment. Our approach utilizes a Rank-N-Contrast objective to learn ordered relationships by contrasting samples based on their rankings in the valence-arousal space. EmotionRankCLAP outperforms existing emotion-CLAP methods in modeling emotion ordinality across modalities, measured via a cross-modal retrieval task.

SDDec 27, 2024
Mouth Articulation-Based Anchoring for Improved Cross-Corpus Speech Emotion Recognition

Shreya G. Upadhyay, Ali N. Salman, Carlos Busso et al.

Cross-corpus speech emotion recognition (SER) plays a vital role in numerous practical applications. Traditional approaches to cross-corpus emotion transfer often concentrate on adapting acoustic features to align with different corpora, domains, or labels. However, acoustic features are inherently variable and error-prone due to factors like speaker differences, domain shifts, and recording conditions. To address these challenges, this study adopts a novel contrastive approach by focusing on emotion-specific articulatory gestures as the core elements for analysis. By shifting the emphasis on the more stable and consistent articulatory gestures, we aim to enhance emotion transfer learning in SER tasks. Our research leverages the CREMA-D and MSP-IMPROV corpora as benchmarks and it reveals valuable insights into the commonality and reliability of these articulatory gestures. The findings highlight mouth articulatory gesture potential as a better constraint for improving emotion recognition across different settings or domains.

SDMar 21, 2024
emoDARTS: Joint Optimisation of CNN & Sequential Neural Network Architectures for Superior Speech Emotion Recognition

Thejan Rajapakshe, Rajib Rana, Sara Khalifa et al.

Speech Emotion Recognition (SER) is crucial for enabling computers to understand the emotions conveyed in human communication. With recent advancements in Deep Learning (DL), the performance of SER models has significantly improved. However, designing an optimal DL architecture requires specialised knowledge and experimental assessments. Fortunately, Neural Architecture Search (NAS) provides a potential solution for automatically determining the best DL model. The Differentiable Architecture Search (DARTS) is a particularly efficient method for discovering optimal models. This study presents emoDARTS, a DARTS-optimised joint CNN and Sequential Neural Network (SeqNN: LSTM, RNN) architecture that enhances SER performance. The literature supports the selection of CNN and LSTM coupling to improve performance. While DARTS has previously been used to choose CNN and LSTM operations independently, our technique adds a novel mechanism for selecting CNN and SeqNN operations in conjunction using DARTS. Unlike earlier work, we do not impose limits on the layer order of the CNN. Instead, we let DARTS choose the best layer order inside the DARTS cell. We demonstrate that emoDARTS outperforms conventionally designed CNN-LSTM models and surpasses the best-reported SER results achieved through DARTS on CNN-LSTM by evaluating our approach on the IEMOCAP, MSP-IMPROV, and MSP-Podcast datasets.

ASSep 10, 2025
Joint Learning using Mixture-of-Expert-Based Representation for Enhanced Speech Generation and Robust Emotion Recognition

Jing-Tong Tzeng, Carlos Busso, Chi-Chun Lee

Speech emotion recognition (SER) plays a critical role in building emotion-aware speech systems, but its performance degrades significantly under noisy conditions. Although speech enhancement (SE) can improve robustness, it often introduces artifacts that obscure emotional cues and adds computational overhead to the pipeline. Multi-task learning (MTL) offers an alternative by jointly optimizing SE and SER tasks. However, conventional shared-backbone models frequently suffer from gradient interference and representational conflicts between tasks. To address these challenges, we propose the Sparse Mixture-of-Experts Representation Integration Technique (Sparse MERIT), a flexible MTL framework that applies frame-wise expert routing over self-supervised speech representations. Sparse MERIT incorporates task-specific gating networks that dynamically select from a shared pool of experts for each frame, enabling parameter-efficient and task-adaptive representation learning. Experiments on the MSP-Podcast corpus show that Sparse MERIT consistently outperforms baseline models on both SER and SE tasks. Under the most challenging condition of -5 dB signal-to-noise ratio (SNR), Sparse MERIT improves SER F1-macro by an average of 12.0% over a baseline relying on a SE pre-processing strategy, and by 3.4% over a naive MTL baseline, with statistical significance on unseen noise conditions. For SE, Sparse MERIT improves segmental SNR (SSNR) by 28.2% over the SE pre-processing baseline and by 20.0% over the naive MTL baseline. These results demonstrate that Sparse MERIT provides robust and generalizable performance for both emotion recognition and enhancement tasks in noisy environments.

SDFeb 15
The Interspeech 2026 Audio Reasoning Challenge: Evaluating Reasoning Process Quality for Audio Reasoning Models and Agents

Ziyang Ma, Ruiyang Xu, Yinghao Ma et al.

Recent Large Audio Language Models (LALMs) excel in understanding but often lack transparent reasoning. To address this "black-box" limitation, we organized the Audio Reasoning Challenge at Interspeech 2026, the first shared task dedicated to evaluating Chain-of-Thought (CoT) quality in the audio domain. The challenge introduced MMAR-Rubrics, a novel instance-level protocol assessing the factuality and logic of reasoning chains. Featured Single Model and Agent tracks, the competition attracting 156 teams from 18 countries and regions. Results show agent systems currently lead in reasoning quality, utilizing iterative tool orchestration and cross-modal analysis. Besides, single models are rapidly advancing via reinforcement learning and sophisticated data pipeline. We details the challenge design, methodology, and a comprehensive analysis of state-of-the-art systems, providing new insights for explainable audio intelligence.

AIJan 25
Sentipolis: Emotion-Aware Agents for Social Simulations

Chiyuan Fu, Lyuhao Chen, Yunze Xiao et al.

LLM agents are increasingly used for social simulation, yet emotion is often treated as a transient cue, causing emotional amnesia and weak long-horizon continuity. We present Sentipolis, a framework for emotionally stateful agents that integrates continuous Pleasure-Arousal-Dominance (PAD) representation, dual-speed emotion dynamics, and emotion--memory coupling. Across thousands of interactions over multiple base models and evaluators, Sentipolis improves emotionally grounded behavior, boosting communication, and emotional continuity. Gains are model-dependent: believability increases for higher-capacity models but can drop for smaller ones, and emotion-awareness can mildly reduce adherence to social norms, reflecting a human-like tension between emotion-driven behavior and rule compliance in social simulation. Network-level diagnostics show reciprocal, moderately clustered, and temporally stable relationship structures, supporting the study of cumulative social dynamics such as alliance formation and gradual relationship change.

SDOct 11, 2025
Improving Speech Emotion Recognition with Mutual Information Regularized Generative Model

Chung-Soo Ahn, Rajib Rana, Sunil Sivadas et al.

Although speech emotion recognition (SER) research has been advanced, thanks to deep learning methods, it still suffers from obtaining inputs from large quality-labelled training data. Data augmentation methods have been attempted to mitigate this issue, generative models have shown success among them recently. We propose a data augmentation framework that is aided by cross-modal information transfer and mutual information regularization. Mutual information based metric can serve as an indicator for the quality. Furthermore, we expand this data augmentation scope to multimodal inputs, thanks to mutual information ensureing dependency between modalities. Our framework was tested on three benchmark datasets: IEMOCAP, MSP-IMPROV and MSP-Podcast. The implementation was designed to generate input features that are fed into last layer for emotion classification. Our framework improved the performance of emotion prediction against existing works. Also, we discovered that our framework is able to generate new inputs without any cross-modal information.

SDSep 23, 2025
Finding My Voice: Generative Reconstruction of Disordered Speech for Automated Clinical Evaluation

Karen Rosero, Eunjung Yeo, David R. Mortensen et al. · cmu

We present ChiReSSD, a speech reconstruction framework that preserves children speaker's identity while suppressing mispronunciations. Unlike prior approaches trained on healthy adult speech, ChiReSSD adapts to the voices of children with speech sound disorders (SSD), with particular emphasis on pitch and prosody. We evaluate our method on the STAR dataset and report substantial improvements in lexical accuracy and speaker identity preservation. Furthermore, we automatically predict the phonetic content in the original and reconstructed pairs, where the proportion of corrected consonants is comparable to the percentage of correct consonants (PCC), a clinical speech assessment metric. Our experiments show Pearson correlation of 0.63 between automatic and human expert annotations, highlighting the potential to reduce the manual transcription burden. In addition, experiments on the TORGO dataset demonstrate effective generalization for reconstructing adult dysarthric speech. Our results indicate that disentangled, style-based TTS reconstruction can provide identity-preserving speech across diverse clinical populations.

CLSep 19, 2025
Speaker Style-Aware Phoneme Anchoring for Improved Cross-Lingual Speech Emotion Recognition

Shreya G. Upadhyay, Carlos Busso, Chi-Chun Lee

Cross-lingual speech emotion recognition (SER) remains a challenging task due to differences in phonetic variability and speaker-specific expressive styles across languages. Effectively capturing emotion under such diverse conditions requires a framework that can align the externalization of emotions across different speakers and languages. To address this problem, we propose a speaker-style aware phoneme anchoring framework that aligns emotional expression at the phonetic and speaker levels. Our method builds emotion-specific speaker communities via graph-based clustering to capture shared speaker traits. Using these groups, we apply dual-space anchoring in speaker and phonetic spaces to enable better emotion transfer across languages. Evaluations on the MSP-Podcast (English) and BIIC-Podcast (Taiwanese Mandarin) corpora demonstrate improved generalization over competitive baselines and provide valuable insights into the commonalities in cross-lingual emotion representation.

LGAug 13, 2025
RankList -- A Listwise Preference Learning Framework for Predicting Subjective Preferences

Abinay Reddy Naini, Fernando Diaz, Carlos Busso

Preference learning has gained significant attention in tasks involving subjective human judgments, such as \emph{speech emotion recognition} (SER) and image aesthetic assessment. While pairwise frameworks such as RankNet offer robust modeling of relative preferences, they are inherently limited to local comparisons and struggle to capture global ranking consistency. To address these limitations, we propose RankList, a novel listwise preference learning framework that generalizes RankNet to structured list-level supervision. Our formulation explicitly models local and non-local ranking constraints within a probabilistic framework. The paper introduces a log-sum-exp approximation to improve training efficiency. We further extend RankList with skip-wise comparisons, enabling progressive exposure to complex list structures and enhancing global ranking fidelity. Extensive experiments demonstrate the superiority of our method across diverse modalities. On benchmark SER datasets (MSP-Podcast, IEMOCAP, BIIC Podcast), RankList achieves consistent improvements in Kendall's Tau and ranking accuracy compared to standard listwise baselines. We also validate our approach on aesthetic image ranking using the Artistic Image Aesthetics dataset, highlighting its broad applicability. Through ablation and cross-domain studies, we show that RankList not only improves in-domain ranking but also generalizes better across datasets. Our framework offers a unified, extensible approach for modeling ordered preferences in subjective learning scenarios.

ASJan 19, 2024
Revealing Emotional Clusters in Speaker Embeddings: A Contrastive Learning Strategy for Speech Emotion Recognition

Ismail Rasim Ulgen, Zongyang Du, Carlos Busso et al.

Speaker embeddings carry valuable emotion-related information, which makes them a promising resource for enhancing speech emotion recognition (SER), especially with limited labeled data. Traditionally, it has been assumed that emotion information is indirectly embedded within speaker embeddings, leading to their under-utilization. Our study reveals a direct and useful link between emotion and state-of-the-art speaker embeddings in the form of intra-speaker clusters. By conducting a thorough clustering analysis, we demonstrate that emotion information can be readily extracted from speaker embeddings. In order to leverage this information, we introduce a novel contrastive pretraining approach applied to emotion-unlabeled data for speech emotion recognition. The proposed approach involves the sampling of positive and the negative examples based on the intra-speaker clusters of speaker embeddings. The proposed strategy, which leverages extensive emotion-unlabeled data, leads to a significant improvement in SER performance, whether employed as a standalone pretraining task or integrated into a multi-task pretraining setting.

LGMay 12, 2023
Versatile audio-visual learning for emotion recognition

Lucas Goncalves, Seong-Gyun Leem, Wei-Cheng Lin et al.

Most current audio-visual emotion recognition models lack the flexibility needed for deployment in practical applications. We envision a multimodal system that works even when only one modality is available and can be implemented interchangeably for either predicting emotional attributes or recognizing categorical emotions. Achieving such flexibility in a multimodal emotion recognition system is difficult due to the inherent challenges in accurately interpreting and integrating varied data sources. It is also a challenge to robustly handle missing or partial information while allowing direct switch between regression or classification tasks. This study proposes a versatile audio-visual learning (VAVL) framework for handling unimodal and multimodal systems for emotion regression or emotion classification tasks. We implement an audio-visual framework that can be trained even when audio and visual paired data is not available for part of the training set (i.e., audio only or only video is present). We achieve this effective representation learning with audio-visual shared layers, residual connections over shared layers, and a unimodal reconstruction task. Our experimental results reveal that our architecture significantly outperforms strong baselines on the CREMA-D, MSP-IMPROV, and CMU-MOSEI corpora. Notably, VAVL attains a new state-of-the-art performance in the emotional attribute prediction task on the MSP-IMPROV corpus.

SDJan 19, 2022
Unsupervised Personalization of an Emotion Recognition System: The Unique Properties of the Externalization of Valence in Speech

Kusha Sridhar, Carlos Busso

The prediction of valence from speech is an important, but challenging problem. The externalization of valence in speech has speaker-dependent cues, which contribute to performances that are often significantly lower than the prediction of other emotional attributes such as arousal and dominance. A practical approach to improve valence prediction from speech is to adapt the models to the target speakers in the test set. Adapting a speech emotion recognition (SER) system to a particular speaker is a hard problem, especially with deep neural networks (DNNs), since it requires optimizing millions of parameters. This study proposes an unsupervised approach to address this problem by searching for speakers in the train set with similar acoustic patterns as the speaker in the test set. Speech samples from the selected speakers are used to create the adaptation set. This approach leverages transfer learning using pre-trained models, which are adapted with these speech samples. We propose three alternative adaptation strategies: unique speaker, oversampling and weighting approaches. These methods differ on the use of the adaptation set in the personalization of the valence models. The results demonstrate that a valence prediction model can be efficiently personalized with these unsupervised approaches, leading to relative improvements as high as 13.52%.

CVDec 23, 2020
The Multimodal Driver Monitoring Database: A Naturalistic Corpus to Study Driver Attention

Sumit Jha, Mohamed F. Marzban, Tiancheng Hu et al.

A smart vehicle should be able to monitor the actions and behaviors of the human driver to provide critical warnings or intervene when necessary. Recent advancements in deep learning and computer vision have shown great promise in monitoring human behaviors and activities. While these algorithms work well in a controlled environment, naturalistic driving conditions add new challenges such as illumination variations, occlusions and extreme head poses. A vast amount of in-domain data is required to train models that provide high performance in predicting driving related tasks to effectively monitor driver actions and behaviors. Toward building the required infrastructure, this paper presents the multimodal driver monitoring (MDM) dataset, which was collected with 59 subjects that were recorded performing various tasks. We use the Fi- Cap device that continuously tracks the head movement of the driver using fiducial markers, providing frame-based annotations to train head pose algorithms in naturalistic driving conditions. We ask the driver to look at predetermined gaze locations to obtain accurate correlation between the driver's facial image and visual attention. We also collect data when the driver performs common secondary activities such as navigation using a smart phone and operating the in-car infotainment system. All of the driver's activities are recorded with high definition RGB cameras and time-of-flight depth camera. We also record the controller area network-bus (CAN-Bus), extracting important information. These high quality recordings serve as the ideal resource to train various efficient algorithms for monitoring the driver, providing further advancements in the field of in-vehicle safety systems.

CVDec 23, 2020
Estimation of Driver's Gaze Region from Head Position and Orientation using Probabilistic Confidence Regions

Sumit Jha, Carlos Busso

A smart vehicle should be able to understand human behavior and predict their actions to avoid hazardous situations. Specific traits in human behavior can be automatically predicted, which can help the vehicle make decisions, increasing safety. One of the most important aspects pertaining to the driving task is the driver's visual attention. Predicting the driver's visual attention can help a vehicle understand the awareness state of the driver, providing important contextual information. While estimating the exact gaze direction is difficult in the car environment, a coarse estimation of the visual attention can be obtained by tracking the position and orientation of the head. Since the relation between head pose and gaze direction is not one-to-one, this paper proposes a formulation based on probabilistic models to create salient regions describing the visual attention of the driver. The area of the predicted region is small when the model has high confidence on the prediction, which is directly learned from the data. We use Gaussian process regression (GPR) to implement the framework, comparing the performance with different regression formulations such as linear regression and neural network based methods. We evaluate these frameworks by studying the tradeoff between spatial resolution and accuracy of the probability map using naturalistic recordings collected with the UTDrive platform. We observe that the GPR method produces the best result creating accurate predictions with localized salient regions. For example, the 95% confidence region is defined by an area that covers 3.77% region of a sphere surrounding the driver.

HCSep 1, 2019
The Ambiguous World of Emotion Representation

Vidhyasaharan Sethu, Emily Mower Provost, Julien Epps et al.

Artificial intelligence and machine learning systems have demonstrated huge improvements and human-level parity in a range of activities, including speech recognition, face recognition and speaker verification. However, these diverse tasks share a key commonality that is not true in affective computing: the ground truth information that is inferred can be unambiguously represented. This observation provides some hints as to why affective computing, despite having attracted the attention of researchers for years, may not still be considered a mature field of research. A key reason for this is the lack of a common mathematical framework to describe all the relevant elements of emotion representations. This paper proposes the AMBiguous Emotion Representation (AMBER) framework to address this deficiency. AMBER is a unified framework that explicitly describes categorical, numerical and ordinal representations of emotions, including time varying representations. In addition to explaining the core elements of AMBER, the paper also discusses how some of the commonly employed emotion representation schemes can be viewed through the AMBER framework, and concludes with a discussion of how the proposed framework can be used to reason about current and future affective computing systems.

CLSep 12, 2018
End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent Neural Models

Fei Tao, Carlos Busso

Speech activity detection (SAD) plays an important role in current speech processing systems, including automatic speech recognition (ASR). SAD is particularly difficult in environments with acoustic noise. A practical solution is to incorporate visual information, increasing the robustness of the SAD approach. An audiovisual system has the advantage of being robust to different speech modes (e.g., whisper speech) or background noise. Recent advances in audiovisual speech processing using deep learning have opened opportunities to capture in a principled way the temporal relationships between acoustic and visual features. This study explores this idea proposing a \emph{bimodal recurrent neural network} (BRNN) framework for SAD. The approach models the temporal dynamic of the sequential audiovisual data, improving the accuracy and robustness of the proposed SAD system. Instead of estimating hand-crafted features, the study investigates an end-to-end training approach, where acoustic and visual features are directly learned from the raw data during training. The experimental evaluation considers a large audiovisual corpus with over 60.8 hours of recordings, collected from 105 speakers. The results demonstrate that the proposed framework leads to absolute improvements up to 1.2% under practical scenarios over a VAD baseline using only audio implemented with deep neural network (DNN). The proposed approach achieves 92.7% F1-score when it is evaluated using the sensors from a portable tablet under noisy acoustic environment, which is only 1.0% lower than the performance obtained under ideal conditions (e.g., clean speech obtained with a high definition camera and a close-talking microphone).

HCJun 1, 2018
Speech-Driven Expressive Talking Lips with Conditional Sequential Generative Adversarial Networks

Najmeh Sadoughi, Carlos Busso

Articulation, emotion, and personality play strong roles in the orofacial movements. To improve the naturalness and expressiveness of virtual agents (VAs), it is important that we carefully model the complex interplay between these factors. This paper proposes a conditional generative adversarial network, called conditional sequential GAN (CSG), which learns the relationship between emotion and lexical content in a principled manner. This model uses a set of articulatory and emotional features directly extracted from the speech signal as conditioning inputs, generating realistic movements. A key feature of the approach is that it is a speech-driven framework that does not require transcripts. Our experiments show the superiority of this model over three state-of-the-art baselines in terms of objective and subjective evaluations. When the target emotion is known, we propose to create emotionally dependent models by either adapting the base model with the target emotional data (CSG-Emo-Adapted), or adding emotional conditions as the input of the model (CSG-Emo-Aware). Objective evaluations of these models show improvements for the CSG-Emo-Adapted compared with the CSG model, as the trajectory sequences are closer to the original sequences. Subjective evaluations show significantly better results for this model compared with the CSG model when the target emotion is happiness.

ASMay 25, 2018
Curriculum Learning for Speech Emotion Recognition from Crowdsourced Labels

Reza Lotfian, Carlos Busso

This study introduces a method to design a curriculum for machine-learning to maximize the efficiency during the training process of deep neural networks (DNNs) for speech emotion recognition. Previous studies in other machine-learning problems have shown the benefits of training a classifier following a curriculum where samples are gradually presented in increasing level of difficulty. For speech emotion recognition, the challenge is to establish a natural order of difficulty in the training set to create the curriculum. We address this problem by assuming that ambiguous samples for humans are also ambiguous for computers. Speech samples are often annotated by multiple evaluators to account for differences in emotion perception across individuals. While some sentences with clear emotional content are consistently annotated, sentences with more ambiguous emotional content present important disagreement between individual evaluations. We propose to use the disagreement between evaluators as a measure of difficulty for the classification task. We propose metrics that quantify the inter-evaluation agreement to define the curriculum for regression problems and binary and multi-class classification problems. The experimental results consistently show that relying on a curriculum based on agreement between human judgments leads to statistically significant improvements over baselines trained without a curriculum.

ASApr 28, 2018
Ladder Networks for Emotion Recognition: Using Unsupervised Auxiliary Tasks to Improve Predictions of Emotional Attributes

Srinivas Parthasarathy, Carlos Busso

Recognizing emotions using few attribute dimensions such as arousal, valence and dominance provides the flexibility to effectively represent complex range of emotional behaviors. Conventional methods to learn these emotional descriptors primarily focus on separate models to recognize each of these attributes. Recent work has shown that learning these attributes together regularizes the models, leading to better feature representations. This study explores new forms of regularization by adding unsupervised auxiliary tasks to reconstruct hidden layer representations. This auxiliary task requires the denoising of hidden representations at every layer of an auto-encoder. The framework relies on ladder networks that utilize skip connections between encoder and decoder layers to learn powerful representations of emotional dimensions. The results show that ladder networks improve the performance of the system compared to baselines that individually learn each attribute, and conventional denoising autoencoders. Furthermore, the unsupervised auxiliary tasks have promising potential to be used in a semi-supervised setting, where few labeled sentences are available.

ASApr 20, 2018
Domain Adversarial for Acoustic Emotion Recognition

Mohammed Abdelwahab, Carlos Busso

The performance of speech emotion recognition is affected by the differences in data distributions between train (source domain) and test (target domain) sets used to build and evaluate the models. This is a common problem, as multiple studies have shown that the performance of emotional classifiers drop when they are exposed to data that does not match the distribution used to build the emotion classifiers. The difference in data distributions becomes very clear when the training and testing data come from different domains, causing a large performance gap between validation and testing performance. Due to the high cost of annotating new data and the abundance of unlabeled data, it is crucial to extract as much useful information as possible from the available unlabeled data. This study looks into the use of adversarial multitask training to extract a common representation between train and test domains. The primary task is to predict emotional attribute-based descriptors for arousal, valence, or dominance. The secondary task is to learn a common representation where the train and test domains cannot be distinguished. By using a gradient reversal layer, the gradients coming from the domain classifier are used to bring the source and target domain representations closer. We show that exploiting unlabeled data consistently leads to better emotion recognition performance across all emotional dimensions. We visualize the effect of adversarial training on the feature representation across the proposed deep learning architecture. The analysis shows that the data representations for the train and test domains converge as the data is passed to deeper layers of the network. We also evaluate the difference in performance when we use a shallow neural network versus a \emph{deep neural network} (DNN) and the effect of the number of shared layers used by the task and domain classifiers.

HCAug 4, 2017
Speech-driven Animation with Meaningful Behaviors

Najmeh Sadoughi, Carlos Busso

Conversational agents (CAs) play an important role in human computer interaction. Creating believable movements for CAs is challenging, since the movements have to be meaningful and natural, reflecting the coupling between gestures and speech. Studies in the past have mainly relied on rule-based or data-driven approaches. Rule-based methods focus on creating meaningful behaviors conveying the underlying message, but the gestures cannot be easily synchronized with speech. Data-driven approaches, especially speech-driven models, can capture the relationship between speech and gestures. However, they create behaviors disregarding the meaning of the message. This study proposes to bridge the gap between these two approaches overcoming their limitations. The approach builds a dynamic Bayesian network (DBN), where a discrete variable is added to constrain the behaviors on the underlying constraint. The study implements and evaluates the approach with two constraints: discourse functions and prototypical behaviors. By constraining on the discourse functions (e.g., questions), the model learns the characteristic behaviors associated with a given discourse class learning the rules from the data. By constraining on prototypical behaviors (e.g., head nods), the approach can be embedded in a rule-based system as a behavior realizer creating trajectories that are timely synchronized with speech. The study proposes a DBN structure and a training approach that (1) models the cause-effect relationship between the constraint and the gestures, (2) initializes the state configuration models increasing the range of the generated behaviors, and (3) captures the differences in the behaviors across constraints by enforcing sparse transitions between shared and exclusive states per constraint. Objective and subjective evaluations demonstrate the benefits of the proposed approach over an unconstrained model.