CVCLOct 11, 2022

Learning to Locate Visual Answer in Video Corpus Using Question

arXiv:2210.05423v48 citationsh-index: 22
Originality Incremental advance
AI Analysis

This addresses the challenge of video understanding and retrieval for instructional content, though it is incremental as it builds on existing multimodal and video localization techniques.

The paper tackles the problem of locating visual answers in a large collection of untrimmed instructional videos using natural language questions, introducing the VCVAL task and proposing a cross-modal contrastive global-span method that outperforms other methods on the MedVidCQA dataset.

We introduce a new task, named video corpus visual answer localization (VCVAL), which aims to locate the visual answer in a large collection of untrimmed instructional videos using a natural language question. This task requires a range of skills - the interaction between vision and language, video retrieval, passage comprehension, and visual answer localization. In this paper, we propose a cross-modal contrastive global-span (CCGS) method for the VCVAL, jointly training the video corpus retrieval and visual answer localization subtasks with the global-span matrix. We have reconstructed a dataset named MedVidCQA, on which the VCVAL task is benchmarked. Experimental results show that the proposed method outperforms other competitive methods both in the video corpus retrieval and visual answer localization subtasks. Most importantly, we perform detailed analyses on extensive experiments, paving a new path for understanding the instructional videos, which ushers in further research.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes