CVAIOct 11, 2021

Beyond Accuracy: A Consolidated Tool for Visual Question Answering Benchmarking

arXiv:2110.05159v1664 citations
Originality Synthesis-oriented
AI Analysis

This provides a consolidated tool for researchers and challenge organizers in VQA to better assess model performance in realistic scenarios, though it is incremental as it builds on existing evaluation needs.

The authors tackled the problem of evaluating Visual Question Answering (VQA) systems beyond single-metric accuracy by proposing a browser-based benchmarking tool that tests generalization across datasets, robustness to noise, biases, and uncertainty, with a case study showing state-of-the-art models fail to generalize, such as in recognizing text in images.

On the way towards general Visual Question Answering (VQA) systems that are able to answer arbitrary questions, the need arises for evaluation beyond single-metric leaderboards for specific datasets. To this end, we propose a browser-based benchmarking tool for researchers and challenge organizers, with an API for easy integration of new models and datasets to keep up with the fast-changing landscape of VQA. Our tool helps test generalization capabilities of models across multiple datasets, evaluating not just accuracy, but also performance in more realistic real-world scenarios such as robustness to input noise. Additionally, we include metrics that measure biases and uncertainty, to further explain model behavior. Interactive filtering facilitates discovery of problematic behavior, down to the data sample level. As proof of concept, we perform a case study on four models. We find that state-of-the-art VQA models are optimized for specific tasks or datasets, but fail to generalize even to other in-domain test sets, for example they cannot recognize text in images. Our metrics allow us to quantify which image and question embeddings provide most robustness to a model. All code is publicly available.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes