CVNov 12, 2018

Holistic Multi-modal Memory Network for Movie Question Answering

arXiv:1811.04595v120 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of improving question answering accuracy in multi-modal contexts for AI applications, though it appears incremental as it builds on existing memory network approaches.

The paper tackles the challenge of multi-modal question answering by proposing the Holistic Multi-modal Memory Network (HMMN) framework, which fully integrates interactions between multi-modal context, question, and answer choices to achieve state-of-the-art accuracy on the MovieQA dataset.

Answering questions according to multi-modal context is a challenging problem as it requires a deep integration of different data sources. Existing approaches only employ partial interactions among data sources in one attention hop. In this paper, we present the Holistic Multi-modal Memory Network (HMMN) framework which fully considers the interactions between different input sources (multi-modal context, question) in each hop. In addition, it takes answer choices into consideration during the context retrieval stage. Therefore, the proposed framework effectively integrates multi-modal context, question, and answer information, which leads to more informative context retrieved for question answering. Our HMMN framework achieves state-of-the-art accuracy on MovieQA dataset. Extensive ablation studies show the importance of holistic reasoning and contributions of different attention strategies.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes