Simple Baseline for Visual Question Answering
This work provides a simple, incremental baseline for researchers in visual question answering, highlighting the effectiveness of basic methods.
The authors tackled visual question answering by proposing a simple bag-of-words baseline that combines word and CNN features, achieving comparable performance to recent RNN-based methods on the VQA dataset.
We describe a very simple bag-of-words baseline for visual question answering. This baseline concatenates the word features from the question and CNN features from the image to predict the answer. When evaluated on the challenging VQA dataset [2], it shows comparable performance to many recent approaches using recurrent neural networks. To explore the strength and weakness of the trained model, we also provide an interactive web demo and open-source code. .