CV AI HCJun 23, 2025

Reading Smiles: Proxy Bias in Foundation Models for Facial Emotion Recognition

Iosif Tsangko, Andreas Triantafyllopoulos, Adem Abdelmoula, Adria Mallol-Ragolta, Bjoern W. Schuller

arXiv:2506.19079v13.63 citationsh-index: 14IEEE Access

Originality Synthesis-oriented

AI Analysis

This work addresses potential biases in foundation models for sensitive applications like mental health and education, but it is incremental as it builds on existing benchmarks and introspection methods.

The paper investigates the visual cues used by Vision Language Models for facial emotion recognition, finding that performance shifts depend on visible teeth and that attributes like eyebrow position drive predictions, revealing risks of shortcut learning and bias.

Foundation Models (FMs) are rapidly transforming Affective Computing (AC), with Vision Language Models (VLMs) now capable of recognising emotions in zero shot settings. This paper probes a critical but underexplored question: what visual cues do these models rely on to infer affect, and are these cues psychologically grounded or superficially learnt? We benchmark varying scale VLMs on a teeth annotated subset of AffectNet dataset and find consistent performance shifts depending on the presence of visible teeth. Through structured introspection of, the best-performing model, i.e., GPT-4o, we show that facial attributes like eyebrow position drive much of its affective reasoning, revealing a high degree of internal consistency in its valence-arousal predictions. These patterns highlight the emergent nature of FMs behaviour, but also reveal risks: shortcut learning, bias, and fairness issues especially in sensitive domains like mental health and education.

View on arXiv PDF

Similar