CV CLNov 2, 2023

GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks

Xinlu Zhang, Yujie Lu, Weizhi Wang, An Yan, Jun Yan, Lianke Qin, Heng Wang, Xifeng Yan, William Yang Wang, Linda Ruth Petzold

arXiv:2311.01361v131.2136 citationsh-index: 19

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of human-aligned evaluation for vision-language tasks, offering a potential universal automatic evaluator, though it is incremental in leveraging an existing model for a new application.

The paper tackled the challenge of automatically evaluating vision-language tasks by systematically exploring GPT-4V as a generalist evaluator, showing promising agreement with human judgments across various tasks and methods.

Automatically evaluating vision-language tasks is challenging, especially when it comes to reflecting human judgments due to limitations in accounting for fine-grained details. Although GPT-4V has shown promising results in various multi-modal tasks, leveraging GPT-4V as a generalist evaluator for these tasks has not yet been systematically explored. We comprehensively validate GPT-4V's capabilities for evaluation purposes, addressing tasks ranging from foundational image-to-text and text-to-image synthesis to high-level image-to-image translations and multi-images to text alignment. We employ two evaluation methods, single-answer grading and pairwise comparison, using GPT-4V. Notably, GPT-4V shows promising agreement with humans across various tasks and evaluation methods, demonstrating immense potential for multi-modal LLMs as evaluators. Despite limitations like restricted visual clarity grading and real-world complex reasoning, its ability to provide human-aligned scores enriched with detailed explanations is promising for universal automatic evaluator.

View on arXiv PDF

Similar