CVJun 25, 2021

Video Moment Retrieval with Text Query Considering Many-to-Many Correspondence Using Potentially Relevant Pair

arXiv:2106.13566v14 citations
Originality Incremental advance
AI Analysis

This work addresses a specific challenge in video retrieval for applications like search and analysis, but it is incremental as it builds on existing methods by refining training with linguistic analysis.

The paper tackles the problem of text-based video moment retrieval by addressing many-to-many correspondences between text annotations and video moments, proposing a training method that incorporates potentially relevant pairs to improve retrieval performance, with experiments showing quantitative and qualitative gains on benchmark datasets.

In this paper we undertake the task of text-based video moment retrieval from a corpus of videos. To train the model, text-moment paired datasets were used to learn the correct correspondences. In typical training methods, ground-truth text-moment pairs are used as positive pairs, whereas other pairs are regarded as negative pairs. However, aside from the ground-truth pairs, some text-moment pairs should be regarded as positive. In this case, one text annotation can be positive for many video moments. Conversely, one video moment can be corresponded to many text annotations. Thus, there are many-to-many correspondences between the text annotations and video moments. Based on these correspondences, we can form potentially relevant pairs, which are not given as ground truth yet are not negative; effectively incorporating such relevant pairs into training can improve the retrieval performance. The text query should describe what is happening in a video moment. Hence, different video moments annotated with similar texts, which contain a similar action, are likely to hold the similar action, thus these pairs can be considered as potentially relevant pairs. In this paper, we propose a novel training method that takes advantage of potentially relevant pairs, which are detected based on linguistic analysis about text annotation. Experiments on two benchmark datasets revealed that our method improves the retrieval performance both quantitatively and qualitatively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes