CLCVApr 16, 2019

Unsupervised Discovery of Multimodal Links in Multi-image, Multi-sentence Documents

arXiv:1904.07826v21005 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of linking images and text in web documents for applications like content analysis, but it is incremental as it builds on existing multimodal methods.

The paper tackled the problem of discovering image-sentence relationships in multimodal documents without explicit annotations, and found that a structured training objective based on co-occurrence in documents can predict specific links, as tested on seven datasets of varying difficulty.

Images and text co-occur constantly on the web, but explicit links between images and sentences (or other intra-document textual units) are often not present. We present algorithms that discover image-sentence relationships without relying on explicit multimodal annotation in training. We experiment on seven datasets of varying difficulty, ranging from documents consisting of groups of images captioned post hoc by crowdworkers to naturally-occurring user-generated multimodal documents. We find that a structured training objective based on identifying whether collections of images and sentences co-occur in documents can suffice to predict links between specific sentences and specific images within the same document at test time.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes