CVJan 24, 2018

Structured Triplet Learning with POS-tag Guided Attention for Visual Question Answering

arXiv:1801.07853v116 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of improving image understanding and language-vision interactions for researchers in AI and computer vision, but it is incremental as it builds on existing VQA methods with specific enhancements.

The paper tackled the visual question answering (VQA) multiple-choice task by designing a model that incorporates POS-tag guided attention, convolutional n-grams, triplet attention interactions, and structured learning for triplets, achieving state-of-the-art performance of 68.2% on Visual7W and a competitive 69.6% on VQA Real Multiple Choice.

Visual question answering (VQA) is of significant interest due to its potential to be a strong test of image understanding systems and to probe the connection between language and vision. Despite much recent progress, general VQA is far from a solved problem. In this paper, we focus on the VQA multiple-choice task, and provide some good practices for designing an effective VQA model that can capture language-vision interactions and perform joint reasoning. We explore mechanisms of incorporating part-of-speech (POS) tag guided attention, convolutional n-grams, triplet attention interactions between the image, question and candidate answer, and structured learning for triplets based on image-question pairs. We evaluate our models on two popular datasets: Visual7W and VQA Real Multiple Choice. Our final model achieves the state-of-the-art performance of 68.2% on Visual7W, and a very competitive performance of 69.6% on the test-standard split of VQA Real Multiple Choice.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes