CLCVMar 15, 2024

EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models

arXiv:2403.10378v151 citationsh-index: 47ACL
Originality Synthesis-oriented
AI Analysis

This provides a new benchmark for evaluating vision language models, addressing the need for diverse and complex multimodal reasoning tasks, though it is incremental as it builds on existing evaluation frameworks.

The authors tackled the problem of evaluating vision language models by introducing EXAMS-V, a multi-discipline multilingual multimodal exam benchmark with 20,932 multiple-choice questions across 20 disciplines and 11 languages, which proved challenging for advanced models like GPT-4V and Gemini.

We introduce EXAMS-V, a new challenging multi-discipline multimodal multilingual exam benchmark for evaluating vision language models. It consists of 20,932 multiple-choice questions across 20 school disciplines covering natural science, social science, and other miscellaneous studies, e.g., religion, fine arts, business, etc. EXAMS-V includes a variety of multimodal features such as text, images, tables, figures, diagrams, maps, scientific symbols, and equations. The questions come in 11 languages from 7 language families. Unlike existing benchmarks, EXAMS-V is uniquely curated by gathering school exam questions from various countries, with a variety of education systems. This distinctive approach calls for intricate reasoning across diverse languages and relies on region-specific knowledge. Solving the problems in the dataset requires advanced perception and joint reasoning over the text and the visual content of the image. Our evaluation results demonstrate that this is a challenging dataset, which is difficult even for advanced vision-text models such as GPT-4V and Gemini; this underscores the inherent complexity of the dataset and its significance as a future benchmark.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes