CVOct 8, 2020

Visual News: Benchmark and Challenges in News Image Captioning

arXiv:2010.03743v3687 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of generating informative captions for news images, which is crucial for media and journalism, but it is incremental as it builds on existing Transformer architectures with new fusion techniques.

The authors tackled news image captioning by introducing Visual News, a large-scale benchmark with over one million news images and metadata, and proposed Visual News Captioner, an entity-aware model that generates captions with richer event and entity information, achieving slightly better results with fewer parameters than competing methods.

We propose Visual News Captioner, an entity-aware model for the task of news image captioning. We also introduce Visual News, a large-scale benchmark consisting of more than one million news images along with associated news articles, image captions, author information, and other metadata. Unlike the standard image captioning task, news images depict situations where people, locations, and events are of paramount importance. Our proposed method can effectively combine visual and textual features to generate captions with richer information such as events and entities. More specifically, built upon the Transformer architecture, our model is further equipped with novel multi-modal feature fusion techniques and attention mechanisms, which are designed to generate named entities more accurately. Our method utilizes much fewer parameters while achieving slightly better prediction results than competing methods. Our larger and more diverse Visual News dataset further highlights the remaining challenges in captioning news images.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes