CVMay 31, 2019

Scene Text Visual Question Answering

arXiv:1905.13648v2491 citations
Originality Synthesis-oriented
AI Analysis

This work addresses a gap in VQA datasets for researchers by focusing on textual cues in images, though it is incremental as it builds on existing VQA frameworks.

The authors tackled the lack of scene text consideration in visual question answering by introducing the ST-VQA dataset, which requires reading text in images to answer questions, and they proposed a new evaluation metric and baseline methods.

Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the VQA process. We use this dataset to define a series of tasks of increasing difficulty for which reading the scene text in the context provided by the visual information is necessary to reason and generate an appropriate answer. We propose a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module. In addition we put forward a series of baseline methods, which provide further insight to the newly released dataset, and set the scene for further research.

Code Implementations4 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes