MMNov 18, 2021

Triple Attention Network architecture for MovieQA

arXiv:2111.09531v1
Originality Incremental advance
AI Analysis

This work addresses movie question answering for multimedia AI by incrementally extending dual-attention methods to include audio.

The paper tackled the MovieQA task by incorporating audio into a triple-attention network, resulting in a 7% performance improvement relative to using only visual features.

Movie question answering, or MovieQA is a multimedia related task wherein one is provided with a video, the subtitle information, a question and candidate answers for it. The task is to predict the correct answer for the question using the components of the multimedia - namely video/images, audio and text. Traditionally, MovieQA is done using the image and text component of the multimedia. In this paper, we propose a novel network with triple-attention architecture for the inclusion of audio in the Movie QA task. This architecture is fashioned after a traditional dual attention network focused only on video and text. Experiments show that the inclusion of audio using the triple-attention network results provides complementary information for Movie QA task which is not captured by visual or textual component in the data. Experiments with a wide range of audio features show that using such a network can indeed improve MovieQA performance by about 7% relative to just using only visual features.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes