CLCVJul 17, 2024

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

arXiv:2407.12772v2301 citationsh-index: 23Has Code
AI Analysis

This work addresses the need for effective and reliable benchmarking for researchers and developers of large multimodal models, though it is incremental in refining existing evaluation approaches.

The paper tackles the problem of evaluating large multimodal models by introducing LMMS-EVAL, a benchmark with over 50 tasks and 10 models, but finds it falls short in low cost and zero contamination, leading to LMMS-EVAL LITE and Multimodal LIVEBENCH as practical solutions.

The advances of large foundation models necessitate wide-coverage, low-cost, and zero-contamination benchmarks. Despite continuous exploration of language model evaluations, comprehensive studies on the evaluation of Large Multi-modal Models (LMMs) remain limited. In this work, we introduce LMMS-EVAL, a unified and standardized multimodal benchmark framework with over 50 tasks and more than 10 models to promote transparent and reproducible evaluations. Although LMMS-EVAL offers comprehensive coverage, we find it still falls short in achieving low cost and zero contamination. To approach this evaluation trilemma, we further introduce LMMS-EVAL LITE, a pruned evaluation toolkit that emphasizes both coverage and efficiency. Additionally, we present Multimodal LIVEBENCH that utilizes continuously updating news and online forums to assess models' generalization abilities in the wild, featuring a low-cost and zero-contamination evaluation approach. In summary, our work highlights the importance of considering the evaluation trilemma and provides practical solutions to navigate the trade-offs in evaluating large multi-modal models, paving the way for more effective and reliable benchmarking of LMMs. We opensource our codebase and maintain leaderboard of LIVEBENCH at https://github.com/EvolvingLMMs-Lab/lmms-eval and https://huggingface.co/spaces/lmms-lab/LiveBench.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes