CL LGMay 20, 2025

Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models

Zahraa Al Sahili, Ioannis Patras, Matthew Purver

arXiv:2505.14160v46.72 citationsh-index: 4IJCNLP-AACL

Originality Incremental advance

AI Analysis

This work highlights critical social bias issues in multilingual AI models, revealing that multilinguality can reinforce rather than mitigate disparities, which is important for developers and users aiming for fair and equitable AI systems.

The study systematically audits gender and racial disparities in four multilingual vision-language models, finding that all models exhibit stronger gender bias than English-only baselines, with low-resource languages showing the largest biases and cross-lingual weight sharing importing stereotypes into gender-neutral languages.

Multilingual vision-language models (VLMs) promise universal image-text retrieval, yet their social biases remain underexplored. We perform the first systematic audit of four public multilingual CLIP variants: M-CLIP, NLLB-CLIP, CAPIVARA-CLIP, and the debiased SigLIP-2, covering ten languages that differ in resource availability and morphological gender marking. Using balanced subsets of FairFace and the PATA stereotype suite in a zero-shot setting, we quantify race and gender bias and measure stereotype amplification. Contrary to the intuition that multilinguality mitigates bias, every model exhibits stronger gender skew than its English-only baseline. CAPIVARA-CLIP shows its largest biases precisely in the low-resource languages it targets, while the shared encoder of NLLB-CLIP and SigLIP-2 transfers English gender stereotypes into gender-neutral languages; loosely coupled encoders largely avoid this leakage. Although SigLIP-2 reduces agency and communion skews, it inherits -- and in caption-sparse contexts (e.g., Xhosa) amplifies -- the English anchor's crime associations. Highly gendered languages consistently magnify all bias types, yet gender-neutral languages remain vulnerable whenever cross-lingual weight sharing imports foreign stereotypes. Aggregated metrics thus mask language-specific hot spots, underscoring the need for fine-grained, language-aware bias evaluation in future multilingual VLM research.

View on arXiv PDF

Similar