CLAICVDec 19, 2024

Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Language Models

arXiv:2412.14613v22 citationsh-index: 19
Originality Incremental advance
AI Analysis

This addresses the challenge of adapting evaluation metrics to multi-task scenarios in vision-language models, which is incremental as it builds on existing VLM evaluation methods.

The paper tackles the problem of evaluating text generated by vision-language models across multiple tasks and criteria, proposing HarmonicEval, a reference-free metric that aggregates criterion-wise scores, and constructs the MMHE dataset with 18,000 expert judgments. The result shows HarmonicEval achieves higher correlations with human judgments than conventional metrics while providing numerical scores for each criterion.

Vision-language models (VLMs) have shown impressive abilities across a range of multi-modal tasks. However, existing metrics for evaluating the quality of text generated by VLMs typically focus on an overall evaluation for a specific task, such as image captioning. While the overall evaluation is essential for any task, the criteria prioritized can differ depending on the task, making it challenging for current metrics to adapt to multi-task scenarios. To address this limitation, we propose HarmonicEval, a reference-free comprehensive evaluation metric that aggregates criterion-wise scores to produce the overall score in a bottom-up manner. Furthermore, we construct the Multi-task Multi-criteria Human Evaluation (MMHE) dataset, which comprises 18,000 expert human judgments across four multi-modal tasks. Our experiments demonstrate that HarmonicEval achieves higher correlations with human judgments than conventional metrics while providing numerical scores for each criterion.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes