CVJul 20, 2022

Explicit Image Caption Editing

arXiv:2207.09625v115 citationsh-index: 34
Originality Incremental advance
AI Analysis

This addresses the need for explainable and efficient caption editing in computer vision, though it is incremental as it builds on existing implicit models by adding explicit editing capabilities.

The paper tackles the problem of image caption editing by introducing Explicit Caption Editing (ECE), a task that generates a sequence of edit operations to refine captions, and proposes TIger, a non-autoregressive transformer-based model that achieves effectiveness on new benchmarks like COCO-EE and Flickr30K-EE.

Given an image and a reference caption, the image caption editing task aims to correct the misalignment errors and generate a refined caption. However, all existing caption editing works are implicit models, ie, they directly produce the refined captions without explicit connections to the reference captions. In this paper, we introduce a new task: Explicit Caption Editing (ECE). ECE models explicitly generate a sequence of edit operations, and this edit operation sequence can translate the reference caption into a refined one. Compared to the implicit editing, ECE has multiple advantages: 1) Explainable: it can trace the whole editing path. 2) Editing Efficient: it only needs to modify a few words. 3) Human-like: it resembles the way that humans perform caption editing, and tries to keep original sentence structures. To solve this new task, we propose the first ECE model: TIger. TIger is a non-autoregressive transformer-based model, consisting of three modules: Tagger_del, Tagger_add, and Inserter. Specifically, Tagger_del decides whether each word should be preserved or not, Tagger_add decides where to add new words, and Inserter predicts the specific word for adding. To further facilitate ECE research, we propose two new ECE benchmarks by re-organizing two existing datasets, dubbed COCO-EE and Flickr30K-EE, respectively. Extensive ablations on both two benchmarks have demonstrated the effectiveness of TIger.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes