Multimodal Dual Attention Memory for Video Story Question Answering
This work addresses video story question answering for applications in multimedia understanding, representing an incremental improvement over existing methods.
The authors tackled video story question answering by proposing the Multimodal Dual Attention Memory (MDAM) architecture, which uses a dual attention mechanism with late fusion to achieve new state-of-the-art results on PororoQA and MovieQA datasets with significant margins.
We propose a video story question-answering (QA) architecture, Multimodal Dual Attention Memory (MDAM). The key idea is to use a dual attention mechanism with late fusion. MDAM uses self-attention to learn the latent concepts in scene frames and captions. Given a question, MDAM uses the second attention over these latent concepts. Multimodal fusion is performed after the dual attention processes (late fusion). Using this processing pipeline, MDAM learns to infer a high-level vision-language joint representation from an abstraction of the full video content. We evaluate MDAM on PororoQA and MovieQA datasets which have large-scale QA annotations on cartoon videos and movies, respectively. For both datasets, MDAM achieves new state-of-the-art results with significant margins compared to the runner-up models. We confirm the best performance of the dual attention mechanism combined with late fusion by ablation studies. We also perform qualitative analysis by visualizing the inference mechanisms of MDAM.