SPApr 12, 2022
GMSS: Graph-Based Multi-Task Self-Supervised Learning for EEG Emotion RecognitionYang Li, Ji Chen, Fu Li et al.
Previous electroencephalogram (EEG) emotion recognition relies on single-task learning, which may lead to overfitting and learned emotion features lacking generalization. In this paper, a graph-based multi-task self-supervised learning model (GMSS) for EEG emotion recognition is proposed. GMSS has the ability to learn more general representations by integrating multiple self-supervised tasks, including spatial and frequency jigsaw puzzle tasks, and contrastive learning tasks. By learning from multiple tasks simultaneously, GMSS can find a representation that captures all of the tasks thereby decreasing the chance of overfitting on the original task, i.e., emotion recognition task. In particular, the spatial jigsaw puzzle task aims to capture the intrinsic spatial relationships of different brain regions. Considering the importance of frequency information in EEG emotional signals, the goal of the frequency jigsaw puzzle task is to explore the crucial frequency bands for EEG emotion recognition. To further regularize the learned features and encourage the network to learn inherent representations, contrastive learning task is adopted in this work by mapping the transformed data into a common feature space. The performance of the proposed GMSS is compared with several popular unsupervised and supervised methods. Experiments on SEED, SEED-IV, and MPED datasets show that the proposed model has remarkable advantages in learning more discriminative and general features for EEG emotional signals.
SDOct 22, 2022
Speech Emotion Recognition via an Attentive Time-Frequency Neural NetworkCheng Lu, Wenming Zheng, Hailun Lian et al.
Spectrogram is commonly used as the input feature of deep neural networks to learn the high(er)-level time-frequency pattern of speech signal for speech emotion recognition (SER). \textcolor{black}{Generally, different emotions correspond to specific energy activations both within frequency bands and time frames on spectrogram, which indicates the frequency and time domains are both essential to represent the emotion for SER. However, recent spectrogram-based works mainly focus on modeling the long-term dependency in time domain, leading to these methods encountering the following two issues: (1) neglecting to model the emotion-related correlations within frequency domain during the time-frequency joint learning; (2) ignoring to capture the specific frequency bands associated with emotions.} To cope with the issues, we propose an attentive time-frequency neural network (ATFNN) for SER, including a time-frequency neural network (TFNN) and time-frequency attention. Specifically, aiming at the first issue, we design a TFNN with a frequency-domain encoder (F-Encoder) based on the Transformer encoder and a time-domain encoder (T-Encoder) based on the Bidirectional Long Short-Term Memory (Bi-LSTM). The F-Encoder and T-Encoder model the correlations within frequency bands and time frames, respectively, and they are embedded into a time-frequency joint learning strategy to obtain the time-frequency patterns for speech emotions. Moreover, to handle the second issue, we also adopt time-frequency attention with a frequency-attention network (F-Attention) and a time-attention network (T-Attention) to focus on the emotion-related frequency band ranges and time frame ranges, which can enhance the discriminability of speech emotion features.
SDFeb 17, 2023
Deep Implicit Distribution Alignment Networks for Cross-Corpus Speech Emotion RecognitionYan Zhao, Jincen Wang, Yuan Zong et al.
In this paper, we propose a novel deep transfer learning method called deep implicit distribution alignment networks (DIDAN) to deal with cross-corpus speech emotion recognition (SER) problem, in which the labeled training (source) and unlabeled testing (target) speech signals come from different corpora. Specifically, DIDAN first adopts a simple deep regression network consisting of a set of convolutional and fully connected layers to directly regress the source speech spectrums into the emotional labels such that the proposed DIDAN can own the emotion discriminative ability. Then, such ability is transferred to be also applicable to the target speech samples regardless of corpus variance by resorting to a well-designed regularization term called implicit distribution alignment (IDA). Unlike widely-used maximum mean discrepancy (MMD) and its variants, the proposed IDA absorbs the idea of sample reconstruction to implicitly align the distribution gap, which enables DIDAN to learn both emotion discriminative and corpus invariant features from speech spectrums. To evaluate the proposed DIDAN, extensive cross-corpus SER experiments on widely-used speech emotion corpora are carried out. Experimental results show that the proposed DIDAN can outperform lots of recent state-of-the-art methods in coping with the cross-corpus SER tasks.
SPAug 9, 2023
EEG-based Emotion Style Transfer Network for Cross-dataset Emotion RecognitionYijin Zhou, Fu Li, Yang Li et al.
As the key to realizing aBCIs, EEG emotion recognition has been widely studied by many researchers. Previous methods have performed well for intra-subject EEG emotion recognition. However, the style mismatch between source domain (training data) and target domain (test data) EEG samples caused by huge inter-domain differences is still a critical problem for EEG emotion recognition. To solve the problem of cross-dataset EEG emotion recognition, in this paper, we propose an EEG-based Emotion Style Transfer Network (E2STN) to obtain EEG representations that contain the content information of source domain and the style information of target domain, which is called stylized emotional EEG representations. The representations are helpful for cross-dataset discriminative prediction. Concretely, E2STN consists of three modules, i.e., transfer module, transfer evaluation module, and discriminative prediction module. The transfer module encodes the domain-specific information of source and target domains and then re-constructs the source domain's emotional pattern and the target domain's statistical characteristics into the new stylized EEG representations. In this process, the transfer evaluation module is adopted to constrain the generated representations that can more precisely fuse two kinds of complementary information from source and target domains and avoid distorting. Finally, the generated stylized EEG representations are fed into the discriminative prediction module for final classification. Extensive experiments show that the E2STN can achieve the state-of-the-art performance on cross-dataset EEG emotion recognition tasks.
CVOct 7, 2023
Learning to Rank Onset-Occurring-Offset Representations for Micro-Expression RecognitionJie Zhu, Yuan Zong, Jingang Shi et al.
This paper focuses on the research of micro-expression recognition (MER) and proposes a flexible and reliable deep learning method called learning to rank onset-occurring-offset representations (LTR3O). The LTR3O method introduces a dynamic and reduced-size sequence structure known as 3O, which consists of onset, occurring, and offset frames, for representing micro-expressions (MEs). This structure facilitates the subsequent learning of ME-discriminative features. A noteworthy advantage of the 3O structure is its flexibility, as the occurring frame is randomly extracted from the original ME sequence without the need for accurate frame spotting methods. Based on the 3O structures, LTR3O generates multiple 3O representation candidates for each ME sample and incorporates well-designed modules to measure and calibrate their emotional expressiveness. This calibration process ensures that the distribution of these candidates aligns with that of macro-expressions (MaMs) over time. Consequently, the visibility of MEs can be implicitly enhanced, facilitating the reliable learning of more discriminative features for MER. Extensive experiments were conducted to evaluate the performance of LTR3O using three widely-used ME databases: CASME II, SMIC, and SAMM. The experimental results demonstrate the effectiveness and superior performance of LTR3O, particularly in terms of its flexibility and reliability, when compared to recent state-of-the-art MER methods.
CVOct 6, 2023
Towards A Robust Group-level Emotion Recognition via Uncertainty-Aware LearningQing Zhu, Qirong Mao, Jialin Zhang et al.
Group-level emotion recognition (GER) is an inseparable part of human behavior analysis, aiming to recognize an overall emotion in a multi-person scene. However, the existing methods are devoted to combing diverse emotion cues while ignoring the inherent uncertainties under unconstrained environments, such as congestion and occlusion occurring within a group. Additionally, since only group-level labels are available, inconsistent emotion predictions among individuals in one group can confuse the network. In this paper, we propose an uncertainty-aware learning (UAL) method to extract more robust representations for GER. By explicitly modeling the uncertainty of each individual, we utilize stochastic embedding drawn from a Gaussian distribution instead of deterministic point embedding. This representation captures the probabilities of different emotions and generates diverse predictions through this stochasticity during the inference stage. Furthermore, uncertainty-sensitive scores are adaptively assigned as the fusion weights of individuals' face within each group. Moreover, we develop an image enhancement module to enhance the model's robustness against severe noise. The overall three-branch model, encompassing face, object, and scene component, is guided by a proportional-weighted fusion strategy and integrates the proposed uncertainty-aware method to produce the final group-level output. Experimental results demonstrate the effectiveness and generalization ability of our method across three widely used databases.
CVSep 18, 2022
SDFE-LV: A Large-Scale, Multi-Source, and Unconstrained Database for Spotting Dynamic Facial Expressions in Long VideosXiaolin Xu, Yuan Zong, Wenming Zheng et al.
In this paper, we present a large-scale, multi-source, and unconstrained database called SDFE-LV for spotting the onset and offset frames of a complete dynamic facial expression from long videos, which is known as the topic of dynamic facial expression spotting (DFES) and a vital prior step for lots of facial expression analysis tasks. Specifically, SDFE-LV consists of 1,191 long videos, each of which contains one or more complete dynamic facial expressions. Moreover, each complete dynamic facial expression in its corresponding long video was independently labeled for five times by 10 well-trained annotators. To the best of our knowledge, SDFE-LV is the first unconstrained large-scale database for the DFES task whose long videos are collected from multiple real-world/closely real-world media sources, e.g., TV interviews, documentaries, movies, and we-media short videos. Therefore, DFES tasks on SDFE-LV database will encounter numerous difficulties in practice such as head posture changes, occlusions, and illumination. We also provided a comprehensive benchmark evaluation from different angles by using lots of recent state-of-the-art deep spotting methods and hence researchers interested in DFES can quickly and easily get started. Finally, with the deep discussions on the experimental evaluation results, we attempt to point out several meaningful directions to deal with DFES tasks and hope that DFES can be better advanced in the future. In addition, SDFE-LV will be freely released for academic use only as soon as possible.
CVJul 17, 2024
Temporal Label Hierachical Network for Compound Emotion RecognitionSunan Li, Hailun Lian, Cheng Lu et al.
The emotion recognition has attracted more attention in recent decades. Although significant progress has been made in the recognition technology of the seven basic emotions, existing methods are still hard to tackle compound emotion recognition that occurred commonly in practical application. This article introduces our achievements in the 7th Field Emotion Behavior Analysis (ABAW) competition. In the competition, we selected pre trained ResNet18 and Transformer, which have been widely validated, as the basic network framework. Considering the continuity of emotions over time, we propose a time pyramid structure network for frame level emotion prediction. Furthermore. At the same time, in order to address the lack of data in composite emotion recognition, we utilized fine-grained labels from the DFEW database to construct training data for emotion categories in competitions. Taking into account the characteristics of valence arousal of various complex emotions, we constructed a classification framework from coarse to fine in the label space.
CVOct 16, 2023
An Empirical Study of Super-resolution on Low-resolution Micro-expression RecognitionLing Zhou, Mingpei Wang, Xiaohua Huang et al.
Micro-expression recognition (MER) in low-resolution (LR) scenarios presents an important and complex challenge, particularly for practical applications such as group MER in crowded environments. Despite considerable advancements in super-resolution techniques for enhancing the quality of LR images and videos, few study has focused on investigate super-resolution for improving LR MER. The scarcity of investigation can be attributed to the inherent difficulty in capturing the subtle motions of micro-expressions, even in original-resolution MER samples, which becomes even more challenging in LR samples due to the loss of distinctive features. Furthermore, a lack of systematic benchmarking and thorough analysis of super-resolution-assisted MER methods has been noted. This paper tackles these issues by conducting a series of benchmark experiments that integrate both super-resolution (SR) and MER methods, guided by an in-depth literature survey. Specifically, we employ seven cutting-edge state-of-the-art (SOTA) MER techniques and evaluate their performance on samples generated from 13 SOTA SR techniques, thereby addressing the problem of super-resolution in MER. Through our empirical study, we uncover the primary challenges associated with SR-assisted MER and identify avenues to tackle these challenges by leveraging recent advancements in both SR and MER methodologies. Our analysis provides insights for progressing toward more efficient SR-assisted MER.
CVAug 13, 2024
A Survey of Deep Learning for Group-level Emotion RecognitionXiaohua Huang, Jinke Xu, Wenming Zheng et al.
With the advancement of artificial intelligence (AI) technology, group-level emotion recognition (GER) has emerged as an important area in analyzing human behavior. Early GER methods are primarily relied on handcrafted features. However, with the proliferation of Deep Learning (DL) techniques and their remarkable success in diverse tasks, neural networks have garnered increasing interest in GER. Unlike individual's emotion, group emotions exhibit diversity and dynamics. Presently, several DL approaches have been proposed to effectively leverage the rich information inherent in group-level image and enhance GER performance significantly. In this survey, we present a comprehensive review of DL techniques applied to GER, proposing a new taxonomy for the field cover all aspects of GER based on DL. The survey overviews datasets, the deep GER pipeline, and performance comparisons of the state-of-the-art methods past decade. Moreover, it summarizes and discuss the fundamental approaches and advanced developments for each aspect. Furthermore, we identify outstanding challenges and suggest potential avenues for the design of robust GER systems. To the best of our knowledge, thus survey represents the first comprehensive review of deep GER methods, serving as a pivotal references for future GER research endeavors.
SPMay 8
A Hybrid Graph Neural Network for Enhanced EEG-Based Depression DetectionYiye Wang, Wenming Zheng, Yang Li et al.
Graph neural networks (GNNs) are becoming increasingly popular for EEG-based depression detection. However, previous GNN-based methods fail to sufficiently consider the characteristics of depression, thus limiting their performance. Firstly, studies in neuroscience indicate that depression patients exhibit both common and individualized brain abnormal patterns. Previous GNN-based approaches typically focus either on fixed graph connections to capture common abnormal brain patterns or on adaptive connections to capture individualized patterns, which is inadequate for depression detection. Secondly, brain network exhibits a hierarchical structure, which includes the arrangement from channel-level graph to region-level graph. This hierarchical structure varies among individuals and contains significant information relevant to detecting depression. Nonetheless, previous GNN-based methods overlook these individualized hierarchical information. To address these issues, we propose a Hybrid GNN (HGNN) that merges a Common Graph Neural Network (CGNN) branch utilizing fixed connection and an Individualized Graph Neural Network (IGNN) branch employing adaptive connections. The two branches capture common and individualized depression patterns respectively, complementing each other. Furthermore, we enhance the IGNN branch with a Graph Pooling and Unpooling Module (GPUM) to extract individualized hierarchical information. Extensive experiments on two public datasets show that our model achieves state-of-the-art performance.
CVJul 28, 2025Code
Learning Transferable Facial Emotion Representations from Large-Scale Semantically Rich CaptionsLicai Sun, Xingxun Jiang, Haoyu Chen et al.
Current facial emotion recognition systems are predominately trained to predict a fixed set of predefined categories or abstract dimensional values. This constrained form of supervision hinders generalization and applicability, as it reduces the rich and nuanced spectrum of emotions into oversimplified labels or scales. In contrast, natural language provides a more flexible, expressive, and interpretable way to represent emotions, offering a much broader source of supervision. Yet, leveraging semantically rich natural language captions as supervisory signals for facial emotion representation learning remains relatively underexplored, primarily due to two key challenges: 1) the lack of large-scale caption datasets with rich emotional semantics, and 2) the absence of effective frameworks tailored to harness such rich supervision. To this end, we introduce EmoCap100K, a large-scale facial emotion caption dataset comprising over 100,000 samples, featuring rich and structured semantic descriptions that capture both global affective states and fine-grained local facial behaviors. Building upon this dataset, we further propose EmoCapCLIP, which incorporates a joint global-local contrastive learning framework enhanced by a cross-modal guided positive mining module. This design facilitates the comprehensive exploitation of multi-level caption information while accommodating semantic similarities between closely related expressions. Extensive evaluations on over 20 benchmarks covering five tasks demonstrate the superior performance of our method, highlighting the promise of learning facial emotion representations from large-scale semantically rich captions. The code and data will be available at https://github.com/sunlicai/EmoCapCLIP.
LGDec 28, 2025Code
Multimodal Functional Maximum Correlation for Emotion RecognitionDeyang Zheng, Tianyi Zhang, Wenming Zheng et al.
Emotional states manifest as coordinated yet heterogeneous physiological responses across central and autonomic systems, posing a fundamental challenge for multimodal representation learning in affective computing. Learning such joint dynamics is further complicated by the scarcity and subjectivity of affective annotations, which motivates the use of self-supervised learning (SSL). However, most existing SSL approaches rely on pairwise alignment objectives, which are insufficient to characterize dependencies among more than two modalities and fail to capture higher-order interactions arising from coordinated brain and autonomic responses. To address this limitation, we propose Multimodal Functional Maximum Correlation (MFMC), a principled SSL framework that maximizes higher-order multimodal dependence through a Dual Total Correlation (DTC) objective. By deriving a tight sandwich bound and optimizing it using a functional maximum correlation analysis (FMCA) based trace surrogate, MFMC captures joint multimodal interactions directly, without relying on pairwise contrastive losses. Experiments on three public affective computing benchmarks demonstrate that MFMC consistently achieves state-of-the-art or competitive performance under both subject-dependent and subject-independent evaluation protocols, highlighting its robustness to inter-subject variability. In particular, MFMC improves subject-dependent accuracy on CEAP-360VR from 78.9% to 86.8%, and subject-independent accuracy from 27.5% to 33.1% using the EDA signal alone. Moreover, MFMC remains within 0.8 percentage points of the best-performing method on the most challenging EEG subject-independent split of MAHNOB-HCI. Our code is available at https://github.com/DY9910/MFMC.
CLMay 4
PC-MNet: Dual-Level Congruity Modeling for Multimodal Sarcasm Detection via Polarity-Modulated AttentionMaoheng Li, Ling Zhou, Xiaohua Huang et al.
Multimodal sarcasm detection, which aims to precisely identify pragmatic incongruities between literal text and nonverbal cues, has gained substantial attention in multimodal understanding. Recent advancements have predominantly relied on na\"ıve similarity-based attention mechanisms and uniform late fusion strategies.Furthermore, given that functional entanglement restricts traditional late fusions, we incorporate a scalar congruity routing mechanism and a prior-guided contextual graph. This mechanism anchors a generalized incongruity manifold through a two-stage asymmetric optimization driven by inconsistency-aware contrastive learning, selectively fusing only the most discriminative multi-granularity evidence. Extensive experiments on the \texttt{MUStARD} benchmark and its spurious-correlation-mitigated balanced datasets demonstrate that our approach achieves new state-of-the-art performance, surpassing the strongest multimodal baseline by a substantial 3.14\% improvement in Macro-F1. By architecturally isolating atomic, composition, and contextual conflicts. This work provides a robust, decoupled paradigm for modeling subtle pragmatic incongruities in human communication.
HCMar 9, 2024
Computational Analysis of Stress, Depression and Engagement in Mental Health: A SurveyPuneet Kumar, Alexander Vedernikov, Yuwei Chen et al.
Analysis of stress, depression and engagement is less common and more complex than that of frequently discussed emotions such as happiness, sadness, fear and anger. The importance of these psychological states has been increasingly recognized due to their implications for mental health and well-being. Stress and depression are interrelated and together they impact engagement in daily tasks, highlighting the need to explore their interplay. This survey is the first to simultaneously explore computational methods for analyzing stress, depression and engagement. We present a taxonomy and timeline of the computational approaches used to analyze them and we discuss the most commonly used datasets and input modalities, along with the categories and generic pipeline of these approaches. Subsequently, we describe state-of-the-art computational approaches, including a performance summary on the most commonly used datasets. Following this, we explore the applications of stress, depression and engagement analysis, along with the associated challenges, limitations and future research directions.
CVMar 12, 2025
Decoupled Doubly Contrastive Learning for Cross Domain Facial Action Unit DetectionYong Li, Menglin Liu, Zhen Cui et al.
Despite the impressive performance of current vision-based facial action unit (AU) detection approaches, they are heavily susceptible to the variations across different domains and the cross-domain AU detection methods are under-explored. In response to this challenge, we propose a decoupled doubly contrastive adaptation (D$^2$CA) approach to learn a purified AU representation that is semantically aligned for the source and target domains. Specifically, we decompose latent representations into AU-relevant and AU-irrelevant components, with the objective of exclusively facilitating adaptation within the AU-relevant subspace. To achieve the feature decoupling, D$^2$CA is trained to disentangle AU and domain factors by assessing the quality of synthesized faces in cross-domain scenarios when either AU or domain attributes are modified. To further strengthen feature decoupling, particularly in scenarios with limited AU data diversity, D$^2$CA employs a doubly contrastive learning mechanism comprising image and feature-level contrastive learning to ensure the quality of synthesized faces and mitigate feature ambiguities. This new framework leads to an automatically learned, dedicated separation of AU-relevant and domain-relevant factors, and it enables intuitive, scale-specific control of the cross-domain facial image synthesis. Extensive experiments demonstrate the efficacy of D$^2$CA in successfully decoupling AU and domain factors, yielding visually pleasing cross-domain synthesized facial images. Meanwhile, D$^2$CA consistently outperforms state-of-the-art cross-domain AU detection approaches, achieving an average F1 score improvement of 6\%-14\% across various cross-domain scenarios.
ASApr 5
AffectSpeech: A Large-Scale Emotional Speech Dataset with Fine-Grained Textual Descriptions for Speech Emotion Captioning and SynthesisTianhua Qi, Wenming Zheng, Björn W. Schuller et al.
Emotion is essential in spoken communication, yet most existing frameworks in speech emotion modeling rely on predefined categories or low-dimensional continuous attributes, which offer limited expressive capacity. Recent advances in speech emotion captioning and synthesis have shown that textual descriptions provide a more flexible and interpretable alternative for representing affective characteristics in speech. However, progress in this direction is hindered by the lack of an emotional speech dataset aligned with reliable and fine-grained natural language annotations. To tackle this, we introduce AffectSpeech, a large-scale corpus of human-recorded speech enriched with structured descriptions for fine-grained emotion analysis and generation. Each utterance is characterized across six complementary dimensions, including sentiment polarity, open-vocabulary emotion captions, intensity level, prosodic attributes, prominent segments, and semantic content, enabling multi-granular modeling of vocal expression. To balance annotation quality and scalability, we adopt a human-LLM collaborative annotation pipeline that integrates algorithmic pre-labeling, multi-LLM description generation, and human-in-the-loop verification. Furthermore, these annotations are reformulated into diverse descriptive styles to enhance linguistic diversity and reduce stylistic bias in downstream modeling. Experimental results on speech emotion captioning and synthesis demonstrate that models trained on AffectSpeech consistently achieve superior performance across multiple evaluation settings.
CVSep 26, 2025
Incorporating Scene Context and Semantic Labels for Enhanced Group-level Emotion RecognitionQing Zhu, Wangdong Guo, Qirong Mao et al.
Group-level emotion recognition (GER) aims to identify holistic emotions within a scene involving multiple individuals. Current existed methods underestimate the importance of visual scene contextual information in modeling individual relationships. Furthermore, they overlook the crucial role of semantic information from emotional labels for complete understanding of emotions. To address this limitation, we propose a novel framework that incorporates visual scene context and label-guided semantic information to improve GER performance. It involves the visual context encoding module that leverages multi-scale scene information to diversely encode individual relationships. Complementarily, the emotion semantic encoding module utilizes group-level emotion labels to prompt a large language model to generate nuanced emotion lexicons. These lexicons, in conjunction with the emotion labels, are then subsequently refined into comprehensive semantic representations through the utilization of a structured emotion tree. Finally, similarity-aware interaction is proposed to align and integrate visual and semantic information, thereby generating enhanced group-level emotion representations and subsequently improving the performance of GER. Experiments on three widely adopted GER datasets demonstrate that our proposed method achieves competitive performance compared to state-of-the-art methods.
CVAug 13, 2025
MPT: Motion Prompt Tuning for Micro-Expression RecognitionJiateng Liu, Hengcan Shi, Feng Chen et al.
Micro-expression recognition (MER) is crucial in the affective computing field due to its wide application in medical diagnosis, lie detection, and criminal investigation. Despite its significance, obtaining micro-expression (ME) annotations is challenging due to the expertise required from psychological professionals. Consequently, ME datasets often suffer from a scarcity of training samples, severely constraining the learning of MER models. While current large pre-training models (LMs) offer general and discriminative representations, their direct application to MER is hindered by an inability to capture transitory and subtle facial movements-essential elements for effective MER. This paper introduces Motion Prompt Tuning (MPT) as a novel approach to adapting LMs for MER, representing a pioneering method for subtle motion prompt tuning. Particularly, we introduce motion prompt generation, including motion magnification and Gaussian tokenization, to extract subtle motions as prompts for LMs. Additionally, a group adapter is carefully designed and inserted into the LM to enhance it in the target MER domain, facilitating a more nuanced distinction of ME representation. Furthermore, extensive experiments conducted on three widely used MER datasets demonstrate that our proposed MPT consistently surpasses state-of-the-art approaches and verifies its effectiveness.
LGMay 14, 2025
Emotion Knowledge Enhancement for Vision Large Language Models: A Self-Verification Approach for High-Quality Emotion Instruction Data GenerationFeifan Wang, Tengfei Song, Minggui He et al.
Facial emotion perception in the vision large language model (VLLM) is crucial for achieving natural human-machine interaction. However, creating high-quality annotations for both coarse- and fine-grained facial emotion analysis demands costly expertise. The lack of such high-quality instruction data limits the performance of VLLMs in facial emotion perception. To address this, we propose a self-verification approach with emotion knowledge enhancement (SEKE), which generates high-quality instruction data for multi-grained emotion analysis cost-effectively using closed-source VLLM. This approach integrates prior human knowledge to VLLM inference, guided by the inherent correlations between three grained levels of emotion descriptions, i.e., discrete expression, valence-arousal, and action unit, to reliably generate comprehensive annotations. A self-verification strategy with Uncertainty-Aware Monte Carlo sampling (SV-UAMC) is further embedded to efficiently extract more accurate VLLM predictions, further improving annotation reliability. Consequently, we construct a facial emotion instruction dataset (FEID) containing three comprehensive descriptions, which provides coarse- and fine-grained emotional information for effective model training. Additionally, we introduce a facial emotion analysis benchmark (FEAB) to measure the VLLM's corresponding ability. Our method significantly outperforms state-of-the-art methods on three downstream facial emotion analysis tasks.
CVOct 29, 2024
Multi-modal Speech Emotion Recognition via Feature Distribution Adaptation NetworkShaokai Li, Yixuan Ji, Peng Song et al.
In this paper, we propose a novel deep inductive transfer learning framework, named feature distribution adaptation network, to tackle the challenging multi-modal speech emotion recognition problem. Our method aims to use deep transfer learning strategies to align visual and audio feature distributions to obtain consistent representation of emotion, thereby improving the performance of speech emotion recognition. In our model, the pre-trained ResNet-34 is utilized for feature extraction for facial expression images and acoustic Mel spectrograms, respectively. Then, the cross-attention mechanism is introduced to model the intrinsic similarity relationships of multi-modal features. Finally, the multi-modal feature distribution adaptation is performed efficiently with feed-forward network, which is extended using the local maximum mean discrepancy loss. Experiments are carried out on two benchmark datasets, and the results demonstrate that our model can achieve excellent performance compared with existing ones.
CLJan 19, 2024
Speech Swin-Transformer: Exploring a Hierarchical Transformer with Shifted Windows for Speech Emotion RecognitionYong Wang, Cheng Lu, Hailun Lian et al.
Swin-Transformer has demonstrated remarkable success in computer vision by leveraging its hierarchical feature representation based on Transformer. In speech signals, emotional information is distributed across different scales of speech features, e.\,g., word, phrase, and utterance. Drawing above inspiration, this paper presents a hierarchical speech Transformer with shifted windows to aggregate multi-scale emotion features for speech emotion recognition (SER), called Speech Swin-Transformer. Specifically, we first divide the speech spectrogram into segment-level patches in the time domain, composed of multiple frame patches. These segment-level patches are then encoded using a stack of Swin blocks, in which a local window Transformer is utilized to explore local inter-frame emotional information across frame patches of each segment patch. After that, we also design a shifted window Transformer to compensate for patch correlations near the boundaries of segment patches. Finally, we employ a patch merging operation to aggregate segment-level emotional features for hierarchical speech representation by expanding the receptive field of Transformer from frame-level to segment-level. Experimental results demonstrate that our proposed Speech Swin-Transformer outperforms the state-of-the-art methods.
SDJan 18, 2024
Improving Speaker-independent Speech Emotion Recognition Using Dynamic Joint Distribution AdaptationCheng Lu, Yuan Zong, Hailun Lian et al.
In speaker-independent speech emotion recognition, the training and testing samples are collected from diverse speakers, leading to a multi-domain shift challenge across the feature distributions of data from different speakers. Consequently, when the trained model is confronted with data from new speakers, its performance tends to degrade. To address the issue, we propose a Dynamic Joint Distribution Adaptation (DJDA) method under the framework of multi-source domain adaptation. DJDA firstly utilizes joint distribution adaptation (JDA), involving marginal distribution adaptation (MDA) and conditional distribution adaptation (CDA), to more precisely measure the multi-domain distribution shifts caused by different speakers. This helps eliminate speaker bias in emotion features, allowing for learning discriminative and speaker-invariant speech emotion features from coarse-level to fine-level. Furthermore, we quantify the adaptation contributions of MDA and CDA within JDA by using a dynamic balance factor based on $\mathcal{A}$-Distance, promoting to effectively handle the unknown distributions encountered in data from new speakers. Experimental results demonstrate the superior performance of our DJDA as compared to other state-of-the-art (SOTA) methods.
CVDec 14, 2021
Progressive Graph Convolution Network for EEG Emotion RecognitionYijin Zhou, Fu Li, Yang Li et al.
Studies in the area of neuroscience have revealed the relationship between emotional patterns and brain functional regions, demonstrating that dynamic relationships between different brain regions are an essential factor affecting emotion recognition determined through electroencephalography (EEG). Moreover, in EEG emotion recognition, we can observe that clearer boundaries exist between coarse-grained emotions than those between fine-grained emotions, based on the same EEG data; this indicates the concurrence of large coarse- and small fine-grained emotion variations. Thus, the progressive classification process from coarse- to fine-grained categories may be helpful for EEG emotion recognition. Consequently, in this study, we propose a progressive graph convolution network (PGCN) for capturing this inherent characteristic in EEG emotional signals and progressively learning the discriminative EEG features. To fit different EEG patterns, we constructed a dual-graph module to characterize the intrinsic relationship between different EEG channels, containing the dynamic functional connections and static spatial proximity information of brain regions from neuroscience research. Moreover, motivated by the observation of the relationship between coarse- and fine-grained emotions, we adopt a dual-head module that enables the PGCN to progressively learn more discriminative EEG features, from coarse-grained (easy) to fine-grained categories (difficult), referring to the hierarchical characteristic of emotion. To verify the performance of our model, extensive experiments were conducted on two public datasets: SEED-IV and multi-modal physiological emotion database (MPED).
CVNov 30, 2021
Seeking Salient Facial Regions for Cross-Database Micro-Expression RecognitionXingxun Jiang, Yuan Zong, Wenming Zheng et al.
Cross-Database Micro-Expression Recognition (CDMER) aims to develop the Micro-Expression Recognition (MER) methods with strong domain adaptability, i.e., the ability to recognize the Micro-Expressions (MEs) of different subjects captured by different imaging devices in different scenes. The development of CDMER is faced with two key problems: 1) the severe feature distribution gap between the source and target databases; 2) the feature representation bottleneck of ME such local and subtle facial expressions. To solve these problems, this paper proposes a novel Transfer Group Sparse Regression method, namely TGSR, which aims to 1) optimize the measurement and better alleviate the difference between the source and target databases, and 2) highlight the valid facial regions to enhance extracted features, by the operation of selecting the group features from the raw face feature, where each region is associated with a group of raw face feature, i.e., the salient facial region selection. Compared with previous transfer group sparse methods, our proposed TGSR has the ability to select the salient facial regions, which is effective in alleviating the aforementioned problems for better performance and reducing the computational cost at the same time. We use two public ME databases, i.e., CASME II and SMIC, to evaluate our proposed TGSR method. Experimental results show that our proposed TGSR learns the discriminative and explicable regions, and outperforms most state-of-the-art subspace-learning-based domain-adaptive methods for CDMER.
CVJul 13, 2021
Region attention and graph embedding network for occlusion objective class-based micro-expression recognitionQirong Mao, Ling Zhou, Wenming Zheng et al.
Micro-expression recognition (\textbf{MER}) has attracted lots of researchers' attention in a decade. However, occlusion will occur for MER in real-world scenarios. This paper deeply investigates an interesting but unexplored challenging issue in MER, \ie, occlusion MER. First, to research MER under real-world occlusion, synthetic occluded micro-expression databases are created by using various mask for the community. Second, to suppress the influence of occlusion, a \underline{R}egion-inspired \underline{R}elation \underline{R}easoning \underline{N}etwork (\textbf{RRRN}) is proposed to model relations between various facial regions. RRRN consists of a backbone network, the Region-Inspired (\textbf{RI}) module and Relation Reasoning (\textbf{RR}) module. More specifically, the backbone network aims at extracting feature representations from different facial regions, RI module computing an adaptive weight from the region itself based on attention mechanism with respect to the unobstructedness and importance for suppressing the influence of occlusion, and RR module exploiting the progressive interactions among these regions by performing graph convolutions. Experiments are conducted on handout-database evaluation and composite database evaluation tasks of MEGC 2018 protocol. Experimental results show that RRRN can significantly explore the importance of facial regions and capture the cooperative complementary relationship of facial regions for MER. The results also demonstrate RRRN outperforms the state-of-the-art approaches, especially on occlusion, and RRRN acts more robust to occlusion.
CVOct 19, 2020
SMA-STN: Segmented Movement-Attending Spatiotemporal Network forMicro-Expression RecognitionJiateng Liu, Wenming Zheng, Yuan Zong
Correctly perceiving micro-expression is difficult since micro-expression is an involuntary, repressed, and subtle facial expression, and efficiently revealing the subtle movement changes and capturing the significant segments in a micro-expression sequence is the key to micro-expression recognition (MER). To handle the crucial issue, in this paper, we firstly propose a dynamic segmented sparse imaging module (DSSI) to compute dynamic images as local-global spatiotemporal descriptors under a unique sampling protocol, which reveals the subtle movement changes visually in an efficient way. Secondly, a segmented movement-attending spatiotemporal network (SMA-STN) is proposed to further unveil imperceptible small movement changes, which utilizes a spatiotemporal movement-attending module (STMA) to capture long-distance spatial relation for facial expression and weigh temporal segments. Besides, a deviation enhancement loss (DE-Loss) is embedded in the SMA-STN to enhance the robustness of SMA-STN to subtle movement changes in feature level. Extensive experiments on three widely used benchmarks, i.e., CASME II, SAMM, and SHIC, show that the proposed SMA-STN achieves better MER performance than other state-of-the-art methods, which proves that the proposed method is effective to handle the challenging MER problem.
CVSep 21, 2020
A Novel Transferability Attention Neural Network Model for EEG Emotion RecognitionYang Li, Boxun Fu, Fu Li et al.
The existed methods for electroencephalograph (EEG) emotion recognition always train the models based on all the EEG samples indistinguishably. However, some of the source (training) samples may lead to a negative influence because they are significant dissimilar with the target (test) samples. So it is necessary to give more attention to the EEG samples with strong transferability rather than forcefully training a classification model by all the samples. Furthermore, for an EEG sample, from the aspect of neuroscience, not all the brain regions of an EEG sample contains emotional information that can transferred to the test data effectively. Even some brain region data will make strong negative effect for learning the emotional classification model. Considering these two issues, in this paper, we propose a transferable attention neural network (TANN) for EEG emotion recognition, which learns the emotional discriminative information by highlighting the transferable EEG brain regions data and samples adaptively through local and global attention mechanism. This can be implemented by measuring the outputs of multiple brain-region-level discriminators and one single sample-level discriminator. We conduct the extensive experiments on three public EEG emotional datasets. The results validate that the proposed model achieves the state-of-the-art performance.
CVAug 13, 2020
DFEW: A Large-Scale Database for Recognizing Dynamic Facial Expressions in the WildXingxun Jiang, Yuan Zong, Wenming Zheng et al.
Recently, facial expression recognition (FER) in the wild has gained a lot of researchers' attention because it is a valuable topic to enable the FER techniques to move from the laboratory to the real applications. In this paper, we focus on this challenging but interesting topic and make contributions from three aspects. First, we present a new large-scale 'in-the-wild' dynamic facial expression database, DFEW (Dynamic Facial Expression in the Wild), consisting of over 16,000 video clips from thousands of movies. These video clips contain various challenging interferences in practical scenarios such as extreme illumination, occlusions, and capricious pose changes. Second, we propose a novel method called Expression-Clustered Spatiotemporal Feature Learning (EC-STFL) framework to deal with dynamic FER in the wild. Third, we conduct extensive benchmark experiments on DFEW using a lot of spatiotemporal deep feature learning methods as well as our proposed EC-STFL. Experimental results show that DFEW is a well-designed and challenging database, and the proposed EC-STFL can promisingly improve the performance of existing spatiotemporal deep neural networks in coping with the problem of dynamic FER in the wild. Our DFEW database is publicly available and can be freely downloaded from https://dfew-dataset.github.io/.
CVDec 19, 2018
Cross-Database Micro-Expression Recognition: A BenchmarkYuan Zong, Tong Zhang, Wenming Zheng et al.
Cross-database micro-expression recognition (CDMER) is one of recently emerging and interesting problem in micro-expression analysis. CDMER is more challenging than the conventional micro-expression recognition (MER), because the training and testing samples in CDMER come from different micro-expression databases, resulting in the inconsistency of the feature distributions between the training and testing sets. In this paper, we contribute to this topic from three aspects. First, we establish a CDMER experimental evaluation protocol aiming to allow the researchers to conveniently work on this topic and provide a standard platform for evaluating their proposed methods. Second, we conduct benchmark experiments by using NINE state-of-the-art domain adaptation (DA) methods and SIX popular spatiotemporal descriptors for respectively investigating CDMER problem from two different perspectives. Third, we propose a novel DA method called region selective transfer regression (RSTR) to deal with the CDMER task. Our RSTR takes advantage of one important cue for recognizing micro-expressions, i.e., the different contributions of the facial local regions in MER. The overall superior performance of RSTR demonstrates that taking into consideration the important cues benefiting MER, e.g., the facial local region information, contributes to develop effective DA methods for dealing with CDMER problem.
CVNov 30, 2018
Cross-database non-frontal facial expression recognition based on transductive deep transfer learningKeyu Yan, Wenming Zheng, Tong Zhang et al.
Cross-database non-frontal expression recognition is a very meaningful but rather difficult subject in the fields of computer vision and affect computing. In this paper, we proposed a novel transductive deep transfer learning architecture based on widely used VGGface16-Net for this problem. In this framework, the VGGface16-Net is used to jointly learn an common optimal nonlinear discriminative features from the non-frontal facial expression samples between the source and target databases and then we design a novel transductive transfer layer to deal with the cross-database non-frontal facial expression classification task. In order to validate the performance of the proposed transductive deep transfer learning networks, we present extensive crossdatabase experiments on two famous available facial expression databases, namely the BU-3DEF and the Multi-PIE database. The final experimental results show that our transductive deep transfer network outperforms the state-of-the-art cross-database facial expression recognition methods.
CVSep 11, 2018
Context-Dependent Diffusion Network for Visual Relationship DetectionZhen Cui, Chunyan Xu, Wenming Zheng et al.
Visual relationship detection can bridge the gap between computer vision and natural language for scene understanding of images. Different from pure object recognition tasks, the relation triplets of subject-predicate-object lie on an extreme diversity space, such as \textit{person-behind-person} and \textit{car-behind-building}, while suffering from the problem of combinatorial explosion. In this paper, we propose a context-dependent diffusion network (CDDN) framework to deal with visual relationship detection. To capture the interactions of different object instances, two types of graphs, word semantic graph and visual scene graph, are constructed to encode global context interdependency. The semantic graph is built through language priors to model semantic correlations across objects, whilst the visual scene graph defines the connections of scene objects so as to utilize the surrounding scene information. For the graph-structured data, we design a diffusion network to adaptively aggregate information from contexts, which can effectively learn latent representations of visual relationships and well cater to visual relationship detection in view of its isomorphic invariance to graphs. Experiments on two widely-used datasets demonstrate that our proposed method is more effective and achieves the state-of-the-art performance.
SIApr 16, 2018
Walk-Steered Convolution for Graph ClassificationJiatao Jiang, Chunyan Xu, Zhen Cui et al.
Graph classification is a fundamental but challenging issue for numerous real-world applications. Despite recent great progress in image/video classification, convolutional neural networks (CNNs) cannot yet cater to graphs well because of graphical non-Euclidean topology. In this work, we propose a walk-steered convolutional (WSC) network to assemble the essential success of standard convolutional neural networks as well as the powerful representation ability of random walk. Instead of deterministic neighbor searching used in previous graphical CNNs, we construct multi-scale walk fields (a.k.a. local receptive fields) with random walk paths to depict subgraph structures and advocate graph scalability. To express the internal variations of a walk field, Gaussian mixture models are introduced to encode principal components of walk paths therein. As an analogy to a standard convolution kernel on image, Gaussian models implicitly coordinate those unordered vertices/nodes and edges in a local receptive field after projecting to the gradient space of Gaussian parameters. We further stack graph coarsening upon Gaussian encoding by using dynamic clustering, such that high-level semantics of graph can be well learned like the conventional pooling on image. The experimental results on several public datasets demonstrate the superiority of our proposed WSC method over many state-of-the-arts for graph classification.
CVMar 27, 2018
Tensor graph convolutional neural networkTong Zhang, Wenming Zheng, Zhen Cui et al.
In this paper, we propose a novel tensor graph convolutional neural network (TGCNN) to conduct convolution on factorizable graphs, for which here two types of problems are focused, one is sequential dynamic graphs and the other is cross-attribute graphs. Especially, we propose a graph preserving layer to memorize salient nodes of those factorized subgraphs, i.e. cross graph convolution and graph pooling. For cross graph convolution, a parameterized Kronecker sum operation is proposed to generate a conjunctive adjacency matrix characterizing the relationship between every pair of nodes across two subgraphs. Taking this operation, then general graph convolution may be efficiently performed followed by the composition of small matrices, which thus reduces high memory and computational burden. Encapsuling sequence graphs into a recursive learning, the dynamics of graphs can be efficiently encoded as well as the spatial layout of graphs. To validate the proposed TGCNN, experiments are conducted on skeleton action datasets as well as matrix completion dataset. The experiment results demonstrate that our method can achieve more competitive performance with the state-of-the-art methods.
CVFeb 27, 2018
Spatio-Temporal Graph Convolution for Skeleton Based Action RecognitionChaolong Li, Zhen Cui, Wenming Zheng et al.
Variations of human body skeletons may be considered as dynamic graphs, which are generic data representation for numerous real-world applications. In this paper, we propose a spatio-temporal graph convolution (STGC) approach for assembling the successes of local convolutional filtering and sequence learning ability of autoregressive moving average. To encode dynamic graphs, the constructed multi-scale local graph convolution filters, consisting of matrices of local receptive fields and signal mappings, are recursively performed on structured graph data of temporal and spatial domain. The proposed model is generic and principled as it can be generalized into other dynamic models. We theoretically prove the stability of STGC and provide an upper-bound of the signal transformation to be learnt. Further, the proposed recursive model can be stacked into a multi-layer architecture. To evaluate our model, we conduct extensive experiments on four benchmark skeleton-based action datasets, including the large-scale challenging NTU RGB+D. The experimental results demonstrate the effectiveness of our proposed model and the improvement over the state-of-the-art.
CVNov 17, 2017
Action-Attending Graphic Neural NetworkChaolong Li, Zhen Cui, Wenming Zheng et al.
The motion analysis of human skeletons is crucial for human action recognition, which is one of the most active topics in computer vision. In this paper, we propose a fully end-to-end action-attending graphic neural network (A$^2$GNN) for skeleton-based action recognition, in which each irregular skeleton is structured as an undirected attribute graph. To extract high-level semantic representation from skeletons, we perform the local spectral graph filtering on the constructed attribute graphs like the standard image convolution operation. Considering not all joints are informative for action analysis, we design an action-attending layer to detect those salient action units (AUs) by adaptively weighting skeletal joints. Herein the filtering responses are parameterized into a weighting function irrelevant to the order of input nodes. To further encode continuous motion variations, the deep features learnt from skeletal graphs are gathered along consecutive temporal slices and then fed into a recurrent gated network. Finally, the spectral graph filtering, action-attending and recurrent temporal encoding are integrated together to jointly train for the sake of robust action recognition as well as the intelligibility of human actions. To evaluate our A$^2$GNN, we conduct extensive experiments on four benchmark skeleton-based action datasets, including the large-scale challenging NTU RGB+D dataset. The experimental results demonstrate that our network achieves the state-of-the-art performances.
CVJul 26, 2017
Learning a Target Sample Re-Generator for Cross-Database Micro-Expression RecognitionYuan Zong, Xiaohua Huang, Wenming Zheng et al.
In this paper, we investigate the cross-database micro-expression recognition problem, where the training and testing samples are from two different micro-expression databases. Under this setting, the training and testing samples would have different feature distributions and hence the performance of most existing micro-expression recognition methods may decrease greatly. To solve this problem, we propose a simple yet effective method called Target Sample Re-Generator (TSRG) in this paper. By using TSRG, we are able to re-generate the samples from target micro-expression database and the re-generated target samples would share same or similar feature distributions with the original source samples. For this reason, we can then use the classifier learned based on the labeled source samples to accurately predict the micro-expression categories of the unlabeled target samples. To evaluate the performance of the proposed TSRG method, extensive cross-database micro-expression recognition experiments designed based on SMIC and CASME II databases are conducted. Compared with recent state-of-the-art cross-database emotion recognition methods, the proposed TSRG achieves more promising results.
CVMay 30, 2017
Deep manifold-to-manifold transforming network for action recognitionTong Zhang, Wenming Zheng, Zhen Cui et al.
Symmetric positive definite (SPD) matrices (e.g., covariances, graph Laplacians, etc.) are widely used to model the relationship of spatial or temporal domain. Nevertheless, SPD matrices are theoretically embedded on Riemannian manifolds. In this paper, we propose an end-to-end deep manifold-to-manifold transforming network (DMT-Net) which can make SPD matrices flow from one Riemannian manifold to another more discriminative one. To learn discriminative SPD features characterizing both spatial and temporal dependencies, we specifically develop three novel layers on manifolds: (i) the local SPD convolutional layer, (ii) the non-linear SPD activation layer, and (iii) the Riemannian-preserved recursive layer. The SPD property is preserved through all layers without any requirement of singular value decomposition (SVD), which is often used in the existing methods with expensive computation cost. Furthermore, a diagonalizing SPD layer is designed to efficiently calculate the final metric for the classification task. To evaluate our proposed method, we conduct extensive experiments on the task of action recognition, where input signals are popularly modeled as SPD matrices. The experimental results demonstrate that our DMT-Net is much more competitive over state-of-the-art.
CVMay 12, 2017
Spatial-Temporal Recurrent Neural Network for Emotion RecognitionTong Zhang, Wenming Zheng, Zhen Cui et al.
Emotion analysis is a crucial problem to endow artifact machines with real intelligence in many large potential applications. As external appearances of human emotions, electroencephalogram (EEG) signals and video face signals are widely used to track and analyze human's affective information. According to their common characteristics of spatial-temporal volumes, in this paper we propose a novel deep learning framework named spatial-temporal recurrent neural network (STRNN) to unify the learning of two different signal sources into a spatial-temporal dependency model. In STRNN, to capture those spatially cooccurrent variations of human emotions, a multi-directional recurrent neural network (RNN) layer is employed to capture longrange contextual cues by traversing the spatial region of each time slice from multiple angles. Then a bi-directional temporal RNN layer is further used to learn discriminative temporal dependencies from the sequences concatenating spatial features of each time slice produced from the spatial RNN layer. To further select those salient regions of emotion representation, we impose sparse projection onto those hidden states of spatial and temporal domains, which actually also increases the model discriminant ability because of this global consideration. Consequently, such a two-layer RNN model builds spatial dependencies as well as temporal dependencies of the input signals. Experimental results on the public emotion datasets of EEG and facial expression demonstrate the proposed STRNN method is more competitive over those state-of-the-art methods.
CVJul 24, 2016
Recurrent Regression for Face RecognitionYang Li, Wenming Zheng, Zhen Cui
To address the sequential changes of images including poses, in this paper we propose a recurrent regression neural network(RRNN) framework to unify two classic tasks of cross-pose face recognition on still images and video-based face recognition. To imitate the changes of images, we explicitly construct the potential dependencies of sequential images so as to regularize the final learning model. By performing progressive transforms for sequentially adjacent images, RRNN can adaptively memorize and forget the information that benefits for the final classification. For face recognition of still images, given any one image with any one pose, we recurrently predict the images with its sequential poses to expect to capture some useful information of others poses. For video-based face recognition, the recurrent regression takes one entire sequence rather than one image as its input. We verify RRNN in static face dataset MultiPIE and face video dataset YouTube Celebrities(YTC). The comprehensive experimental results demonstrate the effectiveness of the proposed RRNN method.