CVJun 23, 2023

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Tencent
arXiv:2306.13394v51567 citationsh-index: 60Has Code
Originality Incremental advance
AI Analysis

This benchmark addresses the lack of standardized evaluation for MLLMs, benefiting researchers and developers in multimodal AI by enabling fair comparisons and guiding model enhancements.

The authors introduced MME, the first comprehensive evaluation benchmark for Multimodal Large Language Models (MLLMs), measuring perception and cognition across 14 subtasks with manually designed annotations to avoid data leakage, and evaluated 30 advanced MLLMs, revealing significant room for improvement and potential optimization directions.

Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks, showing amazing emergent abilities in recent studies, such as writing poems based on an image. However, it is difficult for these case studies to fully reflect the performance of MLLM, lacking a comprehensive evaluation. In this paper, we fill in this blank, presenting the first comprehensive MLLM Evaluation benchmark MME. It measures both perception and cognition abilities on a total of 14 subtasks. In order to avoid data leakage that may arise from direct use of public datasets for evaluation, the annotations of instruction-answer pairs are all manually designed. The concise instruction design allows us to fairly compare MLLMs, instead of struggling in prompt engineering. Besides, with such an instruction, we can also easily carry out quantitative statistics. A total of 30 advanced MLLMs are comprehensively evaluated on our MME, which not only suggests that existing MLLMs still have a large room for improvement, but also reveals the potential directions for the subsequent model optimization. The data are released at the project page https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation.

Code Implementations4 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes