CVJun 21, 2022

Transformer-Based Multi-modal Proposal and Re-Rank for Wikipedia Image-Caption Matching

arXiv:2206.10436v17 citationsh-index: 33
Originality Synthesis-oriented
AI Analysis

This addresses the issue of managing and improving accessibility of multimedia content in large online encyclopedias like Wikipedia, though it is incremental as it applies existing Transformer methods to a specific challenge.

The paper tackled the problem of matching images to captions in Wikipedia by proposing a two-model cascade using Transformers, achieving a normalized Discounted Cumulative Gain (nDCG) of 0.53 on a Kaggle challenge private leaderboard.

With the increased accessibility of web and online encyclopedias, the amount of data to manage is constantly increasing. In Wikipedia, for example, there are millions of pages written in multiple languages. These pages contain images that often lack the textual context, remaining conceptually floating and therefore harder to find and manage. In this work, we present the system we designed for participating in the Wikipedia Image-Caption Matching challenge on Kaggle, whose objective is to use data associated with images (URLs and visual data) to find the correct caption among a large pool of available ones. A system able to perform this task would improve the accessibility and completeness of multimedia content on large online encyclopedias. Specifically, we propose a cascade of two models, both powered by the recent Transformer model, able to efficiently and effectively infer a relevance score between the query image data and the captions. We verify through extensive experimentation that the proposed two-model approach is an effective way to handle a large pool of images and captions while maintaining bounded the overall computational complexity at inference time. Our approach achieves remarkable results, obtaining a normalized Discounted Cumulative Gain (nDCG) value of 0.53 on the private leaderboard of the Kaggle challenge.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes