CVJun 10, 2018

Learning Answer Embeddings for Visual Question Answering

arXiv:1806.03724v138 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of open-ended visual question answering by improving model transferability across datasets with limited answer overlap, though it is incremental in its method.

The authors tackled the problem of visual question answering by proposing a probabilistic model that learns embeddings for images/questions and answers, enabling the handling of semantic relationships among answers and transfer learning to unseen answers. The approach performed well on both in-domain learning and transfer learning across several datasets.

We propose a novel probabilistic model for visual question answering (Visual QA). The key idea is to infer two sets of embeddings: one for the image and the question jointly and the other for the answers. The learning objective is to learn the best parameterization of those embeddings such that the correct answer has higher likelihood among all possible answers. In contrast to several existing approaches of treating Visual QA as multi-way classification, the proposed approach takes the semantic relationships (as characterized by the embeddings) among answers into consideration, instead of viewing them as independent ordinal numbers. Thus, the learned embedded function can be used to embed unseen answers (in the training dataset). These properties make the approach particularly appealing for transfer learning for open-ended Visual QA, where the source dataset on which the model is learned has limited overlapping with the target dataset in the space of answers. We have also developed large-scale optimization techniques for applying the model to datasets with a large number of answers, where the challenge is to properly normalize the proposed probabilistic models. We validate our approach on several Visual QA datasets and investigate its utility for transferring models across datasets. The empirical results have shown that the approach performs well not only on in-domain learning but also on transfer learning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes