CVAILGDec 27, 2021

Multi-Image Visual Question Answering

arXiv:2112.13706v2
Originality Synthesis-oriented
AI Analysis

This work addresses a specific gap in VQA for multi-image scenarios, but it is incremental as it builds on existing methods like ResNet and BERT.

The paper tackled the problem of Visual Question Answering with multiple image inputs by proposing a new dataset and benchmarking models, achieving 39% word accuracy and 99% image accuracy on the CLEVER+TinyImagenet dataset.

While a lot of work has been done on developing models to tackle the problem of Visual Question Answering, the ability of these models to relate the question to the image features still remain less explored. We present an empirical study of different feature extraction methods with different loss functions. We propose New dataset for the task of Visual Question Answering with multiple image inputs having only one ground truth, and benchmark our results on them. Our final model utilising Resnet + RCNN image features and Bert embeddings, inspired from stacked attention network gives 39% word accuracy and 99% image accuracy on CLEVER+TinyImagenet dataset.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes