CVCLOct 6, 2020

Finding the Evidence: Localization-aware Answer Prediction for Text Visual Question Answering

arXiv:2010.02582v11000 citations
Originality Incremental advance
AI Analysis

This work improves text VQA systems for applications requiring scene understanding and reasoning from text in images, representing an incremental advancement with novel method components.

The paper tackled the problem of text-based visual question answering (text VQA) by addressing the underuse of positional information and lack of evidence for generated answers, proposing a localization-aware answer prediction network (LaAP-Net) that outperformed existing approaches on three benchmark datasets by a noticeable margin.

Image text carries essential information to understand the scene and perform reasoning. Text-based visual question answering (text VQA) task focuses on visual questions that require reading text in images. Existing text VQA systems generate an answer by selecting from optical character recognition (OCR) texts or a fixed vocabulary. Positional information of text is underused and there is a lack of evidence for the generated answer. As such, this paper proposes a localization-aware answer prediction network (LaAP-Net) to address this challenge. Our LaAP-Net not only generates the answer to the question but also predicts a bounding box as evidence of the generated answer. Moreover, a context-enriched OCR representation (COR) for multimodal fusion is proposed to facilitate the localization task. Our proposed LaAP-Net outperforms existing approaches on three benchmark datasets for the text VQA task by a noticeable margin.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes