CLApr 25, 2020

MCQA: Multimodal Co-attention Based Network for Question Answering

arXiv:2004.12238v115 citations
Originality Incremental advance
AI Analysis

It addresses multimodal question answering for AI systems, but is incremental as it builds on prior methods with moderate gains.

The paper tackles multimodal question answering by fusing and aligning text, audio, and video inputs with queries, achieving a 4-7% accuracy improvement on the Social-IQ benchmark dataset.

We present MCQA, a learning-based algorithm for multimodal question answering. MCQA explicitly fuses and aligns the multimodal input (i.e. text, audio, and video), which forms the context for the query (question and answer). Our approach fuses and aligns the question and the answer within this context. Moreover, we use the notion of co-attention to perform cross-modal alignment and multimodal context-query alignment. Our context-query alignment module matches the relevant parts of the multimodal context and the query with each other and aligns them to improve the overall performance. We evaluate the performance of MCQA on Social-IQ, a benchmark dataset for multimodal question answering. We compare the performance of our algorithm with prior methods and observe an accuracy improvement of 4-7%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes