CVSep 8, 2020

Towards Unique and Informative Captioning of Images

arXiv:2009.03949v138 citations
Originality Incremental advance
AI Analysis

This addresses the issue of uninformative captions for users relying on automated image descriptions, though it is incremental as it builds on existing metrics and models.

The paper tackles the problem of generic and inaccurate captions from image captioning models by analyzing existing systems and metrics, revealing flaws, and proposing a new metric SPICE-U that better correlates with human judgments, along with a re-ranking technique that improves state-of-the-art models, e.g., enhancing SPICE-U scores.

Despite considerable progress, state of the art image captioning models produce generic captions, leaving out important image details. Furthermore, these systems may even misrepresent the image in order to produce a simpler caption consisting of common concepts. In this paper, we first analyze both modern captioning systems and evaluation metrics through empirical experiments to quantify these phenomena. We find that modern captioning systems return higher likelihoods for incorrect distractor sentences compared to ground truth captions, and that evaluation metrics like SPICE can be 'topped' using simple captioning systems relying on object detectors. Inspired by these observations, we design a new metric (SPICE-U) by introducing a notion of uniqueness over the concepts generated in a caption. We show that SPICE-U is better correlated with human judgements compared to SPICE, and effectively captures notions of diversity and descriptiveness. Finally, we also demonstrate a general technique to improve any existing captioning model -- by using mutual information as a re-ranking objective during decoding. Empirically, this results in more unique and informative captions, and improves three different state-of-the-art models on SPICE-U as well as average score over existing metrics.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes