CL AI CVMay 3, 2024

Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models

Piotr Padlewski, Max Bain, Matthew Henderson, Zhongkai Zhu, Nishant Relan, Hai Pham, Donovan Ong, Kaloyan Aleksiev, Aitor Ormazabal, Samuel Phua, Ethan Yeo, Eugenie Lamprecht

Peking U

arXiv:2405.02287v116.434 citationsh-index: 41Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the need for rigorous evaluation of multimodal language models, particularly for challenging tasks, but is incremental as it builds on existing benchmarking efforts.

The authors introduced Vibe-Eval, a new open benchmark with 269 visual understanding prompts, including 100 hard ones, to evaluate multimodal chat models, finding that over 50% of hard questions are incorrectly answered by all frontier models and showing that automatic evaluation with Reka Core correlates with human judgment.

We introduce Vibe-Eval: a new open benchmark and framework for evaluating multimodal chat models. Vibe-Eval consists of 269 visual understanding prompts, including 100 of hard difficulty, complete with gold-standard responses authored by experts. Vibe-Eval is open-ended and challenging with dual objectives: (i) vibe checking multimodal chat models for day-to-day tasks and (ii) rigorously testing and probing the capabilities of present frontier models. Notably, our hard set contains >50% questions that all frontier models answer incorrectly. We explore the nuances of designing, evaluating, and ranking models on ultra challenging prompts. We also discuss trade-offs between human and automatic evaluation, and show that automatic model evaluation using Reka Core roughly correlates to human judgment. We offer free API access for the purpose of lightweight evaluation and plan to conduct formal human evaluations for public models that perform well on the Vibe-Eval's automatic scores. We release the evaluation code and data, see https://github.com/reka-ai/reka-vibe-eval

View on arXiv PDF Code

Similar