CVCLLGAug 16, 2021

Who's Waldo? Linking People Across Text and Images

arXiv:2108.07253v221 citations
AI Analysis

This addresses the need for contextual models in vision-language tasks, focusing on person linking rather than objects, though it is incremental as it builds on existing visual grounding work.

The paper tackles the problem of linking people mentioned in captions to those pictured in images by introducing a person-centric visual grounding task and benchmark dataset, achieving improved performance with a Transformer-based method that outperforms strong baselines.

We present a task and benchmark dataset for person-centric visual grounding, the problem of linking between people named in a caption and people pictured in an image. In contrast to prior work in visual grounding, which is predominantly object-based, our new task masks out the names of people in captions in order to encourage methods trained on such image-caption pairs to focus on contextual cues (such as rich interactions between multiple people), rather than learning associations between names and appearances. To facilitate this task, we introduce a new dataset, Who's Waldo, mined automatically from image-caption data on Wikimedia Commons. We propose a Transformer-based method that outperforms several strong baselines on this task, and are releasing our data to the research community to spur work on contextual models that consider both vision and language.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes