CVMay 14, 2025

Unsupervised Multiview Contrastive Language-Image Joint Learning with Pseudo-Labeled Prompts Via Vision-Language Model for 3D/4D Facial Expression Recognition

arXiv:2505.09336v16.21 citationsh-index: 6

Originality Incremental advance

AI Analysis

This addresses facial emotion recognition for applications like human-computer interaction, but it is incremental as it builds on existing vision-language and contrastive learning approaches.

The paper tackles unsupervised representation learning for 3D/4D facial expression recognition by introducing MultiviewVLM, which integrates pseudo-labeled prompts and contrastive learning to align multiview representations, resulting in outperforming state-of-the-art methods.

In this paper, we introduce MultiviewVLM, a vision-language model designed for unsupervised contrastive multiview representation learning of facial emotions from 3D/4D data. Our architecture integrates pseudo-labels derived from generated textual prompts to guide implicit alignment of emotional semantics. To capture shared information across multi-views, we propose a joint embedding space that aligns multiview representations without requiring explicit supervision. We further enhance the discriminability of our model through a novel multiview contrastive learning strategy that leverages stable positive-negative pair sampling. A gradient-friendly loss function is introduced to promote smoother and more stable convergence, and the model is optimized for distributed training to ensure scalability. Extensive experiments demonstrate that MultiviewVLM outperforms existing state-of-the-art methods and can be easily adapted to various real-world applications with minimal modifications.

View on arXiv PDF

Similar