CV CL MMMay 30, 2022

From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering

arXiv:2205.14895v125.674 citationsh-index: 32Has Code

Originality Synthesis-oriented

AI Analysis

This work tackles the problem of limited reasoning capabilities in video understanding for AI researchers, though it is incremental as it builds on existing VideoQA methods by proposing a new dataset.

The paper introduces Causal-VidQA, a dataset for video question-answering that addresses evidence and commonsense reasoning, revealing that state-of-the-art methods perform well on scene descriptions but struggle with reasoning tasks.

Video understanding has achieved great success in representation learning, such as video caption, video object grounding, and video descriptive question-answer. However, current methods still struggle on video reasoning, including evidence reasoning and commonsense reasoning. To facilitate deeper video understanding towards video reasoning, we present the task of Causal-VidQA, which includes four types of questions ranging from scene description (description) to evidence reasoning (explanation) and commonsense reasoning (prediction and counterfactual). For commonsense reasoning, we set up a two-step solution by answering the question and providing a proper reason. Through extensive experiments on existing VideoQA methods, we find that the state-of-the-art methods are strong in descriptions but weak in reasoning. We hope that Causal-VidQA can guide the research of video understanding from representation learning to deeper reasoning. The dataset and related resources are available at \url{https://github.com/bcmi/Causal-VidQA.git}.

View on arXiv PDF Code

Similar