Learning to Count Objects in Natural Images for Visual Question Answering
This addresses a specific bottleneck in VQA models for researchers and practitioners, offering an incremental but effective solution to improve counting accuracy.
The paper tackled the problem of counting objects in natural images for Visual Question Answering by identifying soft attention as a fundamental issue and proposing a neural network component for robust counting from object proposals. The result was state-of-the-art accuracy on the VQA v2 dataset's number category, with a 6.6% improvement on a balanced pair metric over a strong baseline.
Visual Question Answering (VQA) models have struggled with counting objects in natural images so far. We identify a fundamental problem due to soft attention in these models as a cause. To circumvent this problem, we propose a neural network component that allows robust counting from object proposals. Experiments on a toy task show the effectiveness of this component and we obtain state-of-the-art accuracy on the number category of the VQA v2 dataset without negatively affecting other categories, even outperforming ensemble models with our single model. On a difficult balanced pair metric, the component gives a substantial improvement in counting over a strong baseline by 6.6%.