CV AI CYSep 8, 2025

Automated Evaluation of Gender Bias Across 13 Large Multimodal Models

arXiv:2509.07050v11 citationsh-index: 1

Originality Incremental advance

AI Analysis

This addresses the problem of harmful social biases in AI for developers and users, providing a comprehensive benchmark to promote fairness, though it is incremental as it builds on prior bias identification work.

The study tackled gender bias in 13 large multimodal models by evaluating AI-generated images from gender-neutral prompts, revealing that models systematically amplify occupational gender stereotypes, with male representation ranging from 46.7% to 73.3% across professions.

Large multimodal models (LMMs) have revolutionized text-to-image generation, but they risk perpetuating the harmful social biases in their training data. Prior work has identified gender bias in these models, but methodological limitations prevented large-scale, comparable, cross-model analysis. To address this gap, we introduce the Aymara Image Fairness Evaluation, a benchmark for assessing social bias in AI-generated images. We test 13 commercially available LMMs using 75 procedurally-generated, gender-neutral prompts to generate people in stereotypically-male, stereotypically-female, and non-stereotypical professions. We then use a validated LLM-as-a-judge system to score the 965 resulting images for gender representation. Our results reveal (p < .001 for all): 1) LMMs systematically not only reproduce but actually amplify occupational gender stereotypes relative to real-world labor data, generating men in 93.0% of images for male-stereotyped professions but only 22.5% for female-stereotyped professions; 2) Models exhibit a strong default-male bias, generating men in 68.3% of the time for non-stereotyped professions; and 3) The extent of bias varies dramatically across models, with overall male representation ranging from 46.7% to 73.3%. Notably, the top-performing model de-amplified gender stereotypes and approached gender parity, achieving the highest fairness scores. This variation suggests high bias is not an inevitable outcome but a consequence of design choices. Our work provides the most comprehensive cross-model benchmark of gender bias to date and underscores the necessity of standardized, automated evaluation tools for promoting accountability and fairness in AI development.

View on arXiv PDF

Similar