IRCLCVMMAug 2, 2024

Toward Automatic Relevance Judgment using Vision--Language Models for Image--Text Retrieval Evaluation

arXiv:2408.01363v17 citationsh-index: 11Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of automating relevance judgments for multimedia content creators, but it is incremental as it builds on existing VLMs.

The paper assessed Vision-Language Models (VLMs) like CLIP, LLaVA, and GPT-4V for automatic relevance judgment in image-text retrieval, finding that LLaVA and GPT-4V achieved Kendall's τ ~0.4 and GPT-4V had a Cohen's κ of ~0.08, outperforming CLIPScore.

Vision--Language Models (VLMs) have demonstrated success across diverse applications, yet their potential to assist in relevance judgments remains uncertain. This paper assesses the relevance estimation capabilities of VLMs, including CLIP, LLaVA, and GPT-4V, within a large-scale \textit{ad hoc} retrieval task tailored for multimedia content creation in a zero-shot fashion. Preliminary experiments reveal the following: (1) Both LLaVA and GPT-4V, encompassing open-source and closed-source visual-instruction-tuned Large Language Models (LLMs), achieve notable Kendall's $τ\sim 0.4$ when compared to human relevance judgments, surpassing the CLIPScore metric. (2) While CLIPScore is strongly preferred, LLMs are less biased towards CLIP-based retrieval systems. (3) GPT-4V's score distribution aligns more closely with human judgments than other models, achieving a Cohen's $κ$ value of around 0.08, which outperforms CLIPScore at approximately -0.096. These findings underscore the potential of LLM-powered VLMs in enhancing relevance judgments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes