AI LG MMOct 17, 2024

MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures

Jinjie Ni, Yifan Song, Deepanway Ghosal, Bo Li, David Junhao Zhang, Xiang Yue, Fuzhao Xue, Zian Zheng, Kaichen Zhang, Mahir Shah, Kabir Jain, Yang You

DeepMind

arXiv:2410.13754v29.65 citationsh-index: 25

Originality Incremental advance

AI Analysis

This addresses the need for reliable and standardized evaluations in multi-modal AI development, offering a practical tool for researchers and practitioners.

The paper tackles the problem of inconsistent standards and biases in AI evaluations by introducing MixEval-X, a benchmark that standardizes any-to-any evaluations across modalities, achieving up to 0.98 correlation with real-world evaluations while being more efficient.

Perceiving and generating diverse modalities are crucial for AI models to effectively learn from and engage with real-world signals, necessitating reliable evaluations for their development. We identify two major issues in current evaluations: (1) inconsistent standards, shaped by different communities with varying protocols and maturity levels; and (2) significant query, grading, and generalization biases. To address these, we introduce MixEval-X, the first any-to-any, real-world benchmark designed to optimize and standardize evaluations across diverse input and output modalities. We propose multi-modal benchmark mixture and adaptation-rectification pipelines to reconstruct real-world task distributions, ensuring evaluations generalize effectively to real-world use cases. Extensive meta-evaluations show our approach effectively aligns benchmark samples with real-world task distributions. Meanwhile, MixEval-X's model rankings correlate strongly with that of crowd-sourced real-world evaluations (up to 0.98) while being much more efficient. We provide comprehensive leaderboards to rerank existing models and organizations and offer insights to enhance understanding of multi-modal evaluations and inform future research.

View on arXiv PDF

Similar