CVAIAug 11, 2021

Mounting Video Metadata on Transformer-based Language Model for Open-ended Video Question Answering

arXiv:2108.05158v11 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the need for more flexible and inferential models in video question answering for researchers, though it is incremental as it adapts an existing dataset and model.

The paper tackled the limitation of multiple-choice video question answering by converting it to open-ended format, using a fine-tuned GPT2 model with video inputs and subtitles, and showed that incorporating video metadata improved performance on the adapted DramaQA dataset.

Video question answering has recently received a lot of attention from multimodal video researchers. Most video question answering datasets are usually in the form of multiple-choice. But, the model for the multiple-choice task does not infer the answer. Rather it compares the answer candidates for picking the correct answer. Furthermore, it makes it difficult to extend to other tasks. In this paper, we challenge the existing multiple-choice video question answering by changing it to open-ended video question answering. To tackle open-ended question answering, we use the pretrained GPT2 model. The model is fine-tuned with video inputs and subtitles. An ablation study is performed by changing the existing DramaQA dataset to an open-ended question answering, and it shows that performance can be improved using video metadata.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes