CVAICLLGFeb 2, 2023

IC3: Image Captioning by Committee Consensus

arXiv:2302.01328v3138 citationsh-index: 76Has Code
AI Analysis

This addresses the issue of impoverished captions in image captioning for applications like accessibility and visual search, offering a novel approach beyond incremental improvements.

The paper tackles the problem of generating informationally rich image captions by proposing IC3, which creates a single caption from multiple annotator viewpoints, resulting in human raters finding IC3 captions at least as helpful as SOTA baselines over two-thirds of the time and improving automated recall systems by up to 84%.

If you ask a human to describe an image, they might do so in a thousand different ways. Traditionally, image captioning models are trained to generate a single "best" (most like a reference) image caption. Unfortunately, doing so encourages captions that are "informationally impoverished," and focus on only a subset of the possible details, while ignoring other potentially useful information in the scene. In this work, we introduce a simple, yet novel, method: "Image Captioning by Committee Consensus" (IC3), designed to generate a single caption that captures high-level details from several annotator viewpoints. Humans rate captions produced by IC3 at least as helpful as baseline SOTA models more than two thirds of the time, and IC3 can improve the performance of SOTA automated recall systems by up to 84%, outperforming single human-generated reference captions, and indicating significant improvements over SOTA approaches for visual description. Code is available at https://davidmchan.github.io/caption-by-committee/

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes