CL AI LGFeb 22, 2024

GenCeption: Evaluate Vision LLMs with Unlabeled Unimodal Data

Lele Cao, Valentin Buchner, Zineb Senane, Fangkai Yang

arXiv:2402.14973v42.73 citationsh-index: 15Has CodeComputer Speech and Language

Originality Incremental advance

AI Analysis

This addresses the need for cheaper, faster, and less contaminated evaluation methods for MLLMs, though it is incremental as it builds on existing evaluation concepts.

The paper tackles the problem of evaluating multimodal large language models (MLLMs) by proposing GenCeption, an annotation-free method that uses only unimodal data to measure semantic coherence and hallucination tendencies, with empirical results showing strong correlations to established benchmarks.

Multimodal Large Language Models (MLLMs) are typically assessed using expensive annotated multimodal benchmarks, which often lag behind the rapidly evolving demands of MLLM evaluation. This paper outlines and validates GenCeption, a novel, annotation-free evaluation method that requires only unimodal data to measure inter-modality semantic coherence and inversely assesses MLLMs' tendency to hallucinate. This approach eliminates the need for costly data annotation, minimizes the risk of training data contamination, is expected to result in slower benchmark saturation, and avoids the illusion of emerging abilities. Inspired by the DrawCeption game, GenCeption begins with a non-textual sample and proceeds through iterative description and generation steps. The semantic drift across iterations is quantified using the GC@T metric. While GenCeption is principally applicable to MLLMs across various modalities, this paper focuses on its implementation and validation for Vision LLMs (VLLMs). Based on the GenCeption method, we establish the MMECeption benchmark for evaluating VLLMs, and compare the performance of several popular VLLMs and human annotators. Our empirical results validate GenCeption's effectiveness, demonstrating strong correlations with established VLLM benchmarks. VLLMs still significantly lag behind human performance and struggle especially with text-intensive tasks.

View on arXiv PDF Code

Similar