Xilong Lu

CV
h-index5
3papers
9citations
Novelty48%
AI Score27

3 Papers

CVMar 14, 2025
Compound Expression Recognition via Large Vision-Language Models

Jun Yu, Xilong Lu

Compound Expression Recognition (CER) is crucial for understanding human emotions and improving human-computer interaction. However, CER faces challenges due to the complexity of facial expressions and the difficulty of capturing subtle emotional cues. To address these issues, we propose a novel approach leveraging Large Vision-Language Models (LVLMs). Our method employs a two-stage fine-tuning process: first, pre-trained LVLMs are fine-tuned on basic facial expressions to establish foundational patterns; second, the model is further optimized on a compound-expression dataset to refine visual-language feature interactions. Our approach achieves advanced accuracy on the RAF-DB dataset and demonstrates strong zero-shot generalization on the C-EXPR-DB dataset, showcasing its potential for real-world applications in emotion analysis and human-computer interaction.

CVMar 14, 2025
Solution for 8th Competition on Affective & Behavior Analysis in-the-wild

Jun Yu, Yunxiang Zhang, Xilong Lu et al.

In this report, we present our solution for the Action Unit (AU) Detection Challenge, in 8th Competition on Affective Behavior Analysis in-the-wild. In order to achieve robust and accurate classification of facial action unit in the wild environment, we introduce an innovative method that leverages audio-visual multimodal data. Our method employs ConvNeXt as the image encoder and uses Whisper to extract Mel spectrogram features. For these features, we utilize a Transformer encoder-based feature fusion module to integrate the affective information embedded in audio and image features. This ensures the provision of rich high-dimensional feature representations for the subsequent multilayer perceptron (MLP) trained on the Aff-Wild2 dataset, enhancing the accuracy of AU detection.

CVMar 13, 2025
Technical Approach for the EMI Challenge in the 8th Affective Behavior Analysis in-the-Wild Competition

Jun Yu, Lingsi Zhu, Yanjun Chi et al.

Emotional Mimicry Intensity (EMI) estimation plays a pivotal role in understanding human social behavior and advancing human-computer interaction. The core challenges lie in dynamic correlation modeling and robust fusion of multimodal temporal signals. To address the limitations of existing methods--insufficient exploitation of cross-modal synergies, sensitivity to noise, and constrained fine-grained alignment capabilities--this paper proposes a dual-stage cross-modal alignment framework. Stage 1 develops vision-text and audio-text contrastive learning networks based on a CLIP architecture, achieving preliminary feature-space alignment through modality-decoupled pre-training. Stage 2 introduces a temporal-aware dynamic fusion module integrating Temporal Convolutional Networks (TCN) and gated bidirectional LSTM to capture macro-evolution patterns of facial expressions and local dynamics of acoustic features, respectively. A novel quality-guided fusion strategy further enables differentiable weight allocation for modality compensation under occlusion and noise. Experiments on the Hume-Vidmimic2 dataset demonstrate superior performance with an average Pearson correlation coefficient of 0.51 across six emotion dimensions on the validate set. Remarkably, our method achieved 0.68 on the test set, securing runner-up in the EMI Challenge Track of the 8th ABAW (Affective Behavior Analysis in the Wild) Competition, offering a novel pathway for fine-grained emotion analysis in open environments.