CVSep 24, 2025

Bias in the Picture: Benchmarking VLMs with Social-Cue News Images and LLM-as-Judge Assessment

Aravind Narayanan, Vahid Reza Khazaie, Shaina Raza

arXiv:2509.19659v13 citationsh-index: 5

Originality Incremental advance

AI Analysis

This addresses fairness risks in multimodal AI for users of VLMs, though it is incremental as it builds on existing bias assessment methods.

The researchers tackled the problem of harmful social stereotypes in vision-language models by creating a benchmark of 1,343 news-image pairs and evaluating state-of-the-art VLMs, finding that visual context systematically shifts outputs with high bias risks for gender and occupation.

Large vision-language models (VLMs) can jointly interpret images and text, but they are also prone to absorbing and reproducing harmful social stereotypes when visual cues such as age, gender, race, clothing, or occupation are present. To investigate these risks, we introduce a news-image benchmark consisting of 1,343 image-question pairs drawn from diverse outlets, which we annotated with ground-truth answers and demographic attributes (age, gender, race, occupation, and sports). We evaluate a range of state-of-the-art VLMs and employ a large language model (LLM) as judge, with human verification. Our findings show that: (i) visual context systematically shifts model outputs in open-ended settings; (ii) bias prevalence varies across attributes and models, with particularly high risk for gender and occupation; and (iii) higher faithfulness does not necessarily correspond to lower bias. We release the benchmark prompts, evaluation rubric, and code to support reproducible and fairness-aware multimodal assessment.

View on arXiv PDF

Similar