CVApr 23, 2016

Word2VisualVec: Image and Video to Sentence Matching by Visual Feature Prediction

arXiv:1604.06838v235 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of cross-modal retrieval for images and videos, offering an incremental improvement by operating solely in visual space rather than a joint subspace.

The paper tackles the problem of matching images and videos to descriptive sentences by proposing Word2VisualVec, a deep neural network that predicts visual features from text, achieving state-of-the-art results on four benchmarks.

This paper strives to find the sentence best describing the content of an image or video. Different from existing works, which rely on a joint subspace for image / video to sentence matching, we propose to do so in a visual space only. We contribute Word2VisualVec, a deep neural network architecture that learns to predict a deep visual encoding of textual input based on sentence vectorization and a multi-layer perceptron. We thoroughly analyze its architectural design, by varying the sentence vectorization strategy, network depth and the deep feature to predict for image to sentence matching. We also generalize Word2VisualVec for matching a video to a sentence, by extending the predictive abilities to 3-D ConvNet features as well as a visual-audio representation. Experiments on four challenging image and video benchmarks detail Word2VisualVec's properties, capabilities for image and video to sentence matching, and on all datasets its state-of-the-art results.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes