CVCLLGDec 1, 2019

Learning to Relate from Captions and Bounding Boxes

arXiv:1912.00311v11090 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of relation prediction in computer vision with minimal annotation, though it is incremental in leveraging existing weak supervision methods.

The paper tackles the problem of predicting relationships between entities in images using only captions and bounding boxes as weak supervision, achieving a recall@50 of 15% and recall@100 of 25% on the Visual Genome dataset.

In this work, we propose a novel approach that predicts the relationships between various entities in an image in a weakly supervised manner by relying on image captions and object bounding box annotations as the sole source of supervision. Our proposed approach uses a top-down attention mechanism to align entities in captions to objects in the image, and then leverage the syntactic structure of the captions to align the relations. We use these alignments to train a relation classification network, thereby obtaining both grounded captions and dense relationships. We demonstrate the effectiveness of our model on the Visual Genome dataset by achieving a recall@50 of 15% and recall@100 of 25% on the relationships present in the image. We also show that the model successfully predicts relations that are not present in the corresponding captions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes