CVOct 7, 2022

Learning to embed semantic similarity for joint image-text retrieval

arXiv:2210.03838v112 citationsh-index: 30
Originality Incremental advance
AI Analysis

This work addresses the challenge of cross-modal retrieval for applications in multimedia and AI, but it appears incremental as it builds on existing metric learning approaches.

The authors tackled the problem of joint image-text retrieval by learning semantic embeddings in Euclidean space, achieving favorable performance compared to state-of-the-art methods on datasets like MS-COCO, Flickr30K, and Flickr8K.

We present a deep learning approach for learning the joint semantic embeddings of images and captions in a Euclidean space, such that the semantic similarity is approximated by the L2 distances in the embedding space. For that, we introduce a metric learning scheme that utilizes multitask learning to learn the embedding of identical semantic concepts using a center loss. By introducing a differentiable quantization scheme into the end-to-end trainable network, we derive a semantic embedding of semantically similar concepts in Euclidean space. We also propose a novel metric learning formulation using an adaptive margin hinge loss, that is refined during the training phase. The proposed scheme was applied to the MS-COCO, Flicke30K and Flickr8K datasets, and was shown to compare favorably with contemporary state-of-the-art approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes