CVAICLLGSep 14, 2024

Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

arXiv:2409.09269v324 citationsh-index: 13Has Code
Originality Incremental advance
AI Analysis

This provides a framework for selecting VLMs in practical VQA applications, though it is incremental as it builds on existing benchmarks and metrics.

The paper tackles the challenge of evaluating Vision-Language Models (VLMs) for Visual Question-Answering (VQA) by introducing VQA360, a dataset annotated with task types, domains, and knowledge types, and GoEval, a metric with a 56.71% correlation to human judgments, finding that no single VLM excels universally across tasks.

Visual Question-Answering (VQA) has become key to user experience, particularly after improved generalization capabilities of Vision-Language Models (VLMs). But evaluating VLMs for an application requirement using a standardized framework in practical settings is still challenging. This paper aims to solve that using an end-to-end framework. We present VQA360 - a novel dataset derived from established VQA benchmarks, annotated with task types, application domains, and knowledge types, for a comprehensive evaluation. We also introduce GoEval, a multimodal evaluation metric developed using GPT-4o, achieving a correlation factor of 56.71% with human judgments. Our experiments with state-of-the-art VLMs reveal that no single model excels universally, thus, making a right choice a key design decision. Proprietary models such as Gemini-1.5-Pro and GPT-4o-mini generally outperform others, but open-source models like InternVL-2-8B and CogVLM-2-Llama-3-19B also demonstrate competitive strengths, while providing additional advantages. Our framework can also be extended to other tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes