MMSep 12, 2024
Early Joint Learning of Emotion Information Makes MultiModal Model Understand You BetterMengying Ge, Mingyang Li, Dongkai Tang et al.
In this paper, we present our solutions for emotion recognition in the sub-challenges of Multimodal Emotion Recognition Challenge (MER2024). To mitigate the modal competition issue between audio and text, we adopt an early fusion strategy based on a large language model, where joint training of audio and text is conducted initially. And the joint Audio-Text modal feature will be late-fused with other unimodal features. In order to solve the problems of data insufficiency and class imbalance, We use multiple turns of multi-model voting for data mining. Moreover, to enhance the quality of audio features, we employ speech source separation to preprocess audios. Our model ranks \textbf{2nd} in both MER2024-SEMI and MER2024-NOISE, validating our method's effectiveness.
CVAug 21, 2024
Video Emotion Open-vocabulary Recognition Based on Multimodal Large Language ModelMengying Ge, Dongkai Tang, Mingyang Li
Multimodal emotion recognition is a task of great concern. However, traditional data sets are based on fixed labels, resulting in models that often focus on main emotions and ignore detailed emotional changes in complex scenes. This report introduces the solution of using MLLMs technology to generate open-vocabulary emotion labels from a video. The solution includes the use of framework, data generation and processing, training methods, results generation and multi-model co-judgment. In the MER-OV (Open-Word Emotion Recognition) of the MER2024 challenge, our method achieved significant advantages, leading to its superior capabilities in complex emotion computation.
CVJan 5
AlignVTOFF: Texture-Spatial Feature Alignment for High-Fidelity Virtual Try-OffYihan Zhu, Mengying Ge
Virtual Try-Off (VTOFF) is a challenging multimodal image generation task that aims to synthesize high-fidelity flat-lay garments under complex geometric deformation and rich high-frequency textures. Existing methods often rely on lightweight modules for fast feature extraction, which struggles to preserve structured patterns and fine-grained details, leading to texture attenuation during generation.To address these issues, we propose AlignVTOFF, a novel parallel U-Net framework built upon a Reference U-Net and Texture-Spatial Feature Alignment (TSFA). The Reference U-Net performs multi-scale feature extraction and enhances geometric fidelity, enabling robust modeling of deformation while retaining complex structured patterns. TSFA then injects the reference garment features into a frozen denoising U-Net via a hybrid attention design, consisting of a trainable cross-attention module and a frozen self-attention module. This design explicitly aligns texture and spatial cues and alleviates the loss of high-frequency information during the denoising process.Extensive experiments across multiple settings demonstrate that AlignVTOFF consistently outperforms state-of-the-art methods, producing flat-lay garment results with improved structural realism and high-frequency detail fidelity.
CVJan 4, 2020
DAF-NET: a saliency based weakly supervised method of dual attention fusion for fine-grained image classificationZiChao Dong, JiLong Wu, TingTing Ren et al.
Fine-grained image classification is a challenging problem, since the difficulty of finding discriminative features. To handle this circumstance, basically, there are two ways to go. One is use attention based method to focus on informative areas, while the other one aims to find high order between features. Further, for attention based method there are two directions, activation based and detection based, which are proved effective by scholars. However ,rare work focus on fusing two types of attention with high order feature. In this paper, we propose a novel DAF method which fuse two types of attention and use them to as PAF(part attention filter) in deep bilinear transformation module to mine the relationship between separate parts of an object. Briefly, our network constructed by a student net who attempt to output two attention maps and a teacher net uses these two maps as empirical information to refine the result. The experiment result shows that only student net could get 87.6% accuracy in CUB dataset while cooperating with teacher net could achieve 89.1% accuracy.