CV AIJun 26, 2025

Multimodal Prompt Alignment for Facial Expression Recognition

arXiv:2506.21017v110.23 citationsh-index: 22

Originality Incremental advance

AI Analysis

This work addresses the problem of improving accuracy in facial expression recognition for applications like human-computer interaction, though it is incremental as it builds on existing prompt learning methods.

The paper tackled the challenge of capturing fine-grained textual-visual relationships in facial expression recognition (FER) using vision-language models, and the result was that their proposed MPA-FER framework outperformed state-of-the-art methods on three benchmark datasets while maintaining model generalization and low computational costs.

Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs) like CLIP for various downstream tasks. Despite their success, current VLM-based facial expression recognition (FER) methods struggle to capture fine-grained textual-visual relationships, which are essential for distinguishing subtle differences between facial expressions. To address this challenge, we propose a multimodal prompt alignment framework for FER, called MPA-FER, that provides fine-grained semantic guidance to the learning process of prompted visual features, resulting in more precise and interpretable representations. Specifically, we introduce a multi-granularity hard prompt generation strategy that utilizes a large language model (LLM) like ChatGPT to generate detailed descriptions for each facial expression. The LLM-based external knowledge is injected into the soft prompts by minimizing the feature discrepancy between the soft prompts and the hard prompts. To preserve the generalization abilities of the pretrained CLIP model, our approach incorporates prototype-guided visual feature alignment, ensuring that the prompted visual features from the frozen image encoder align closely with class-specific prototypes. Additionally, we propose a cross-modal global-local alignment module that focuses on expression-relevant facial features, further improving the alignment between textual and visual features. Extensive experiments demonstrate our framework outperforms state-of-the-art methods on three FER benchmark datasets, while retaining the benefits of the pretrained model and minimizing computational costs.

View on arXiv PDF

Similar