CVNov 22, 2017

Conditional Image-Text Embedding Networks

arXiv:1711.08389v4125 citations
Originality Incremental advance
AI Analysis

This work addresses phrase grounding in images for computer vision applications, offering an incremental improvement over existing methods.

The paper tackles the problem of grounding phrases in images by jointly learning multiple text-conditioned embeddings in an end-to-end model, achieving improvements of 4%, 3%, and 4% in grounding performance across three datasets compared to a baseline.

This paper presents an approach for grounding phrases in images which jointly learns multiple text-conditioned embeddings in a single end-to-end model. In order to differentiate text phrases into semantically distinct subspaces, we propose a concept weight branch that automatically assigns phrases to embeddings, whereas prior works predefine such assignments. Our proposed solution simplifies the representation requirements for individual embeddings and allows the underrepresented concepts to take advantage of the shared representations before feeding them into concept-specific layers. Comprehensive experiments verify the effectiveness of our approach across three phrase grounding datasets, Flickr30K Entities, ReferIt Game, and Visual Genome, where we obtain a (resp.) 4%, 3%, and 4% improvement in grounding performance over a strong region-phrase embedding baseline.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes