CLAICVFeb 24, 2024

GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation

arXiv:2402.15745v233 citationsh-index: 2ACL
Originality Incremental advance
AI Analysis

This addresses the need for more rigorous evaluation of LVLMs in multilingual and high-stakes contexts, though it is incremental as it builds on existing benchmarking efforts.

The authors tackled the lack of comprehensive benchmarks for Large Vision-Language Models (LVLMs) by introducing GAOKAO-MM, a Chinese multimodal benchmark based on the College Entrance Examination, and found that top models like GPT-4-Vision achieved only 48.1% accuracy, indicating a significant gap from human-level performance.

The Large Vision-Language Models (LVLMs) have demonstrated great abilities in image perception and language understanding. However, existing multimodal benchmarks focus on primary perception abilities and commonsense knowledge which are insufficient to reflect the comprehensive capabilities of LVLMs. We propose GAOKAO-MM, a multimodal benchmark based on the Chinese College Entrance Examination (GAOKAO), comprising of 8 subjects and 12 types of images, such as diagrams, function graphs, maps and photos. GAOKAO-MM derives from native Chinese context and sets human-level requirements for the model's abilities, including perception, understanding, knowledge and reasoning. We evaluate 10 LVLMs and find that the accuracies of all of them are lower than 50%, with GPT-4-Vison (48.1%), Qwen-VL-Plus (41.2%) and Gemini-Pro-Vision (35.1%) ranking in the top three positions. The results of our multi-dimension analysis indicate that LVLMs have moderate distance towards Artificial General Intelligence (AGI) and provide insights facilitating the development of multilingual LVLMs.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes