CV LGOct 26, 2025

FairJudge: MLLM Judging for Social Attributes and Prompt Image Alignment

Zahraa Al Sahili, Maryam Fetanat, Maimuna Nowaz, Ioannis Patras, Matthew Purver

arXiv:2510.22827v23 citationsh-index: 4

Originality Incremental advance

AI Analysis

This addresses the need for reliable and reproducible fairness audits in text-to-image systems, though it is incremental as it builds on existing evaluation methods.

The paper tackles the problem of evaluating text-to-image systems for prompt alignment and social attribute treatment by introducing FairJudge, a lightweight protocol using multimodal LLMs as judges, which outperforms existing baselines on demographic prediction and improves mean alignment while maintaining high profession accuracy.

Text-to-image (T2I) systems lack simple, reproducible ways to evaluate how well images match prompts and how models treat social attributes. Common proxies -- face classifiers and contrastive similarity -- reward surface cues, lack calibrated abstention, and miss attributes only weakly visible (for example, religion, culture, disability). We present FairJudge, a lightweight protocol that treats instruction-following multimodal LLMs as fair judges. It scores alignment with an explanation-oriented rubric mapped to [-1, 1]; constrains judgments to a closed label set; requires evidence grounded in the visible content; and mandates abstention when cues are insufficient. Unlike CLIP-only pipelines, FairJudge yields accountable, evidence-aware decisions; unlike mitigation that alters generators, it targets evaluation fairness. We evaluate gender, race, and age on FairFace, PaTA, and FairCoT; extend to religion, culture, and disability; and assess profession correctness and alignment on IdenProf, FairCoT-Professions, and our new DIVERSIFY-Professions. We also release DIVERSIFY, a 469-image corpus of diverse, non-iconic scenes. Across datasets, judge models outperform contrastive and face-centric baselines on demographic prediction and improve mean alignment while maintaining high profession accuracy, enabling more reliable, reproducible fairness audits.

View on arXiv PDF

Similar