AIFeb 28

Fair in Mind, Fair in Action? A Synchronous Benchmark for Understanding and Generation in UMLLMs

Yiran Zhao, Lu Zhou, Xiaogang Xu, Zhe Liu, Jiafei Wu, Liming Fang
arXiv:2603.00590v11 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of conflicting fairness metrics in AI for researchers and developers, offering a novel benchmark to guide fairness optimization in UMLLMs, though it is incremental in providing a new evaluation framework rather than a fundamental solution.

The authors tackled the challenge of evaluating fairness in unified Multimodal Large Language Models (UMLLMs) by introducing the IRIS Benchmark, which synchronously assesses understanding and generation tasks, uncovering systemic issues like a 'generation gap' and 'personality splits' in leading models.

As artificial intelligence (AI) is increasingly deployed across domains, ensuring fairness has become a core challenge. However, the field faces a "Tower of Babel'' dilemma: fairness metrics abound, yet their underlying philosophical assumptions often conflict, hindering unified paradigms-particularly in unified Multimodal Large Language Models (UMLLMs), where biases propagate systemically across tasks. To address this, we introduce the IRIS Benchmark, to our knowledge the first benchmark designed to synchronously evaluate the fairness of both understanding and generation tasks in UMLLMs. Enabled by our demographic classifier, ARES, and four supporting large-scale datasets, the benchmark is designed to normalize and aggregate arbitrary metrics into a high-dimensional "fairness space'', integrating 60 granular metrics across three dimensions-Ideal Fairness, Real-world Fidelity, and Bias Inertia & Steerability (IRIS). Through this benchmark, our evaluation of leading UMLLMs uncovers systemic phenomena such as the "generation gap'', individual inconsistencies like "personality splits'', and the "counter-stereotype reward'', while offering diagnostics to guide the optimization of their fairness capabilities. With its novel and extensible framework, the IRIS benchmark is capable of integrating evolving fairness metrics, ultimately helping to resolve the "Tower of Babel'' impasse. Project Page: https://iris-benchmark-web.vercel.app/

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes