CVAIDec 17, 2023

An Evaluation of GPT-4V and Gemini in Online VQA

arXiv:2312.10637v28 citationsh-index: 4
Originality Synthesis-oriented
AI Analysis

This work provides a fine-grained analysis of LMM capabilities for researchers, but it is incremental as it focuses on evaluation rather than model improvement.

The researchers evaluated GPT-4V and Gemini on a new visual question answering dataset from an online community, identifying challenging question types like 'puzzling' topics and 'Sheet Music' images.

While there is much excitement about the potential of large multimodal models (LMM), a comprehensive evaluation is critical to establish their true capabilities and limitations. In support of this aim, we evaluate two state-of-the-art LMMs, GPT-4V and Gemini, on a new visual question answering dataset sourced from an authentic online question answering community. We conduct fine-grained analysis by generating seven types of metadata for nearly 2,000 visual questions, such as image type and the required image processing capabilities. Our zero-shot performance analysis highlights the types of questions that are most challenging for both models, including questions related to "puzzling" topic, with "Identification" user intention, with "Sheet Music" image type, or labeled as "hard" by GPT-4.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes