CVNov 21, 2022

Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval

arXiv:2211.11351v17 citationsh-index: 37Has Code
Originality Incremental advance
AI Analysis

This work addresses cross-modal retrieval for video search applications, presenting an incremental improvement in feature combination methods.

The paper tackles the problem of text-to-video retrieval by investigating optimal combinations of textual and visual features into multiple joint spaces, achieving performance documented through experiments on three large-scale datasets (IACC.3, V3C1, and MSR-VTT).

In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on text-to-video retrieval. We investigate how to optimally combine multiple diverse textual and visual features into feature pairs that lead to generating multiple joint feature spaces, which encode text-video pairs into comparable representations. To learn these representations our proposed network architecture is trained by following a multiple space learning procedure. Moreover, at the retrieval stage, we introduce additional softmax operations for revising the inferred query-video similarities. Extensive experiments in several setups based on three large-scale datasets (IACC.3, V3C1, and MSR-VTT) lead to conclusions on how to best combine text-visual features and document the performance of the proposed network. Source code is made publicly available at: https://github.com/bmezaris/TextToVideoRetrieval-TtimesV

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes