CVAICLAug 1, 2024

MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities

Microsoft
arXiv:2408.00765v20.5156 citationsh-index: 67Has Code
AI Analysis15

This work addresses the need for more realistic evaluation benchmarks in multimodal AI research, though it is incremental as it builds on an existing benchmark.

The authors tackled the limitation of existing multimodal benchmarks by introducing MM-Vet v2, a challenging benchmark that evaluates large multimodal models on integrated capabilities including a new image-text sequence understanding task, and found Claude 3.5 Sonnet scored 71.8, slightly outperforming GPT-4o at 71.0.

MM-Vet, with open-ended vision-language questions targeting at evaluating integrated capabilities, has become one of the most popular benchmarks for large multimodal model evaluation. MM-Vet assesses six core vision-language (VL) capabilities: recognition, knowledge, spatial awareness, language generation, OCR, and math. However, its question format is restricted to single image-text pairs, lacking the interleaved image and text sequences prevalent in real-world scenarios. To address this limitation, we introduce MM-Vet v2, which includes a new VL capability called "image-text sequence understanding", evaluating models' ability to process VL sequences. Furthermore, we maintain the high quality of evaluation samples while further expanding the evaluation set size. Using MM-Vet v2 to benchmark large multimodal models, we found that Claude 3.5 Sonnet is the best model with a score of 71.8, slightly outperforming GPT-4o which scored 71.0. Among open-weight models, InternVL2-Llama3-76B leads with a score of 68.4. The code, data, and leaderboard are accessible at https://github.com/yuweihao/MM-Vet.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes