CVApr 16, 2020

Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence

arXiv:2004.07967v11 citations
AI Analysis

This addresses the problem of improving video retrieval accuracy for users by moving beyond single embedding spaces, though it is incremental in nature.

The paper tackles the challenge of matching video dynamics to textual features in video retrieval by proposing a multiple embedding space framework, achieving competitive state-of-the-art performance on a benchmark dataset.

Visual-semantic embedding aims to learn a joint embedding space where related video and sentence instances are located close to each other. Most existing methods put instances in a single embedding space. However, they struggle to embed instances due to the difficulty of matching visual dynamics in videos to textual features in sentences. A single space is not enough to accommodate various videos and sentences. In this paper, we propose a novel framework that maps instances into multiple individual embedding spaces so that we can capture multiple relationships between instances, leading to compelling video retrieval. We propose to produce a final similarity between instances by fusing similarities measured in each embedding space using a weighted sum strategy. We determine the weights according to a sentence. Therefore, we can flexibly emphasize an embedding space. We conducted sentence-to-video retrieval experiments on a benchmark dataset. The proposed method achieved superior performance, and the results are competitive to state-of-the-art methods. These experimental results demonstrated the effectiveness of the proposed multiple embedding approach compared to existing methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes