CVAICLMMAug 8, 2018

Question-Guided Hybrid Convolution for Visual Question Answering

arXiv:1808.02632v171 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of effectively integrating visual spatial information with textual features in VQA, offering an incremental improvement by complementing existing methods like bilinear pooling and attention.

The paper tackles the problem of capturing textual and visual relationships in Visual Question Answering (VQA) by proposing a Question-Guided Hybrid Convolution (QGHC) network, which uses question-guided kernels and group convolution to generate discriminative multi-modal features with fewer parameters, achieving improved performance validated on public VQA datasets.

In this paper, we propose a novel Question-Guided Hybrid Convolution (QGHC) network for Visual Question Answering (VQA). Most state-of-the-art VQA methods fuse the high-level textual and visual features from the neural network and abandon the visual spatial information when learning multi-modal features.To address these problems, question-guided kernels generated from the input question are designed to convolute with visual features for capturing the textual and visual relationship in the early stage. The question-guided convolution can tightly couple the textual and visual information but also introduce more parameters when learning kernels. We apply the group convolution, which consists of question-independent kernels and question-dependent kernels, to reduce the parameter size and alleviate over-fitting. The hybrid convolution can generate discriminative multi-modal features with fewer parameters. The proposed approach is also complementary to existing bilinear pooling fusion and attention based VQA methods. By integrating with them, our method could further boost the performance. Extensive experiments on public VQA datasets validate the effectiveness of QGHC.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes