CVNov 29, 2021

LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering

arXiv:2111.14547v219 citations
AI Analysis

This work addresses the problem of multi-modal video understanding for AI systems, presenting an incremental improvement in representation integration for VideoQA.

The authors tackled the challenge of Video Question Answering by proposing LiVLR, a lightweight framework that integrates multi-grained visual and linguistic representations using a Diversity-aware Visual-Linguistic Reasoning module, achieving performance advantages on benchmarks like MRSVTT-QA and KnowIT VQA.

Video Question Answering (VideoQA), aiming to correctly answer the given question based on understanding multi-modal video content, is challenging due to the rich video content. From the perspective of video understanding, a good VideoQA framework needs to understand the video content at different semantic levels and flexibly integrate the diverse video content to distill question-related content. To this end, we propose a Lightweight Visual-Linguistic Reasoning framework named LiVLR. Specifically, LiVLR first utilizes the graph-based Visual and Linguistic Encoders to obtain multi-grained visual and linguistic representations. Subsequently, the obtained representations are integrated with the devised Diversity-aware Visual-Linguistic Reasoning module (DaVL). The DaVL considers the difference between the different types of representations and can flexibly adjust the importance of different types of representations when generating the question-related joint representation, which is an effective and general representation integration method. The proposed LiVLR is lightweight and shows its performance advantage on two VideoQA benchmarks, MRSVTT-QA and KnowIT VQA. Extensive ablation studies demonstrate the effectiveness of LiVLR key components.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes