AICLCVDec 23, 2017

Interpretable Counting for Visual Question Answering

arXiv:1712.08697v275 citations
AI Analysis

This addresses a major challenge in VQA for applications requiring interpretable object counting, though it is incremental as it builds on existing detection and interaction methods.

The paper tackles the problem of counting objects in images for visual question answering by treating counting as a sequential decision process, where the model makes discrete choices to select objects, resulting in improved performance over state-of-the-art methods on multiple metrics.

Questions that require counting a variety of objects in images remain a major challenge in visual question answering (VQA). The most common approaches to VQA involve either classifying answers based on fixed length representations of both the image and question or summing fractional counts estimated from each section of the image. In contrast, we treat counting as a sequential decision process and force our model to make discrete choices of what to count. Specifically, the model sequentially selects from detected objects and learns interactions between objects that influence subsequent selections. A distinction of our approach is its intuitive and interpretable output, as discrete counts are automatically grounded in the image. Furthermore, our method outperforms the state of the art architecture for VQA on multiple metrics that evaluate counting.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes