CVAug 4, 2025

Evaluating Variance in Visual Question Answering Benchmarks

arXiv:2508.02645v11 citations2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)
Originality Incremental advance
AI Analysis

This work addresses the issue of unreliable evaluation practices for MLLMs in VQA, which is critical for researchers and developers to ensure robust model development, though it is incremental in proposing variance-aware methodologies.

This paper tackled the problem of performance variance in evaluating multimodal large language models (MLLMs) on visual question answering (VQA) benchmarks, finding that factors like training seed and model scale cause significant variability across 14 benchmarks, and it explored Cloze-style evaluation as a strategy to reduce stochasticity.

Multimodal large language models (MLLMs) have emerged as powerful tools for visual question answering (VQA), enabling reasoning and contextual understanding across visual and textual modalities. Despite their advancements, the evaluation of MLLMs on VQA benchmarks often relies on point estimates, overlooking the significant variance in performance caused by factors such as stochastic model outputs, training seed sensitivity, and hyperparameter configurations. This paper critically examines these issues by analyzing variance across 14 widely used VQA benchmarks, covering diverse tasks such as visual reasoning, text understanding, and commonsense reasoning. We systematically study the impact of training seed, framework non-determinism, model scale, and extended instruction finetuning on performance variability. Additionally, we explore Cloze-style evaluation as an alternate assessment strategy, studying its effectiveness in reducing stochasticity and improving reliability across benchmarks. Our findings highlight the limitations of current evaluation practices and advocate for variance-aware methodologies to foster more robust and reliable development of MLLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes