CVMar 24, 2025

SIT-FER: Integration of Semantic-, Instance-, Text-level Information for Semi-supervised Facial Expression Recognition

Sixian Ding, Xu Jiang, Zhongjing Du, Jiaqi Cui, Xinyi Zeng, Yan Wang

arXiv:2503.18463v1h-index: 6Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of limited labeled data for facial expression recognition, which is crucial for applications in human-computer interaction and emotion analysis, representing an incremental advancement by combining multiple information sources.

The paper tackles the problem of unreliable pseudo-labels in semi-supervised facial expression recognition by integrating semantic-, instance-, and text-level information to generate high-quality pseudo-labels, resulting in significant performance improvements that exceed state-of-the-art methods and fully supervised baselines on three datasets.

Semi-supervised deep facial expression recognition (SS-DFER) has gained increasingly research interest due to the difficulty in accessing sufficient labeled data in practical settings. However, existing SS-DFER methods mainly utilize generated semantic-level pseudo-labels for supervised learning, the unreliability of which compromises their performance and undermines the practical utility. In this paper, we propose a novel SS-DFER framework that simultaneously incorporates semantic, instance, and text-level information to generate high-quality pseudo-labels. Specifically, for the unlabeled data, considering the comprehensive knowledge within the textual descriptions and instance representations, we respectively calculate the similarities between the facial vision features and the corresponding textual and instance features to obtain the probabilities at the text- and instance-level. Combining with the semantic-level probability, these three-level probabilities are elaborately aggregated to gain the final pseudo-labels. Furthermore, to enhance the utilization of one-hot labels for the labeled data, we also incorporate text embeddings excavated from textual descriptions to co-supervise model training, enabling facial visual features to exhibit semantic correlations in the text space. Experiments on three datasets demonstrate that our method significantly outperforms current state-of-the-art SS-DFER methods and even exceeds fully supervised baselines. The code will be available at https://github.com/PatrickStarL/SIT-FER.

View on arXiv PDF Code

Similar