CVNov 12, 2018

Holistic Multi-modal Memory Network for Movie Question Answering

Anran Wang, Anh Tuan Luu, Chuan-Sheng Foo, Hongyuan Zhu, Yi Tay, Vijay Chandrasekhar

arXiv:1811.04595v16.820 citations

Originality Incremental advance

AI Analysis

This work addresses the problem of improving question answering accuracy in multi-modal contexts for AI applications, though it appears incremental as it builds on existing memory network approaches.

The paper tackles the challenge of multi-modal question answering by proposing the Holistic Multi-modal Memory Network (HMMN) framework, which fully integrates interactions between multi-modal context, question, and answer choices to achieve state-of-the-art accuracy on the MovieQA dataset.

Answering questions according to multi-modal context is a challenging problem as it requires a deep integration of different data sources. Existing approaches only employ partial interactions among data sources in one attention hop. In this paper, we present the Holistic Multi-modal Memory Network (HMMN) framework which fully considers the interactions between different input sources (multi-modal context, question) in each hop. In addition, it takes answer choices into consideration during the context retrieval stage. Therefore, the proposed framework effectively integrates multi-modal context, question, and answer information, which leads to more informative context retrieved for question answering. Our HMMN framework achieves state-of-the-art accuracy on MovieQA dataset. Extensive ablation studies show the importance of holistic reasoning and contributions of different attention strategies.

View on arXiv PDF

Similar