CLCVLGNENov 30, 2020

Language-Driven Region Pointer Advancement for Controllable Image Captioning

arXiv:2011.14901v1992 citations
AI Analysis

This work provides improved control over image captioning for end-users by better aligning region descriptions with natural language structure, representing an incremental improvement in the field.

This paper addresses controllable image captioning, where specific image regions are described in the generated caption. The authors propose a novel method for predicting the timing of region pointer advancement, achieving 86.55% precision and 97.92% recall on Flickr30k Entities test data, and improving state-of-the-art on standard captioning metrics.

Controllable Image Captioning is a recent sub-field in the multi-modal task of Image Captioning wherein constraints are placed on which regions in an image should be described in the generated natural language caption. This puts a stronger focus on producing more detailed descriptions, and opens the door for more end-user control over results. A vital component of the Controllable Image Captioning architecture is the mechanism that decides the timing of attending to each region through the advancement of a region pointer. In this paper, we propose a novel method for predicting the timing of region pointer advancement by treating the advancement step as a natural part of the language structure via a NEXT-token, motivated by a strong correlation to the sentence structure in the training data. We find that our timing agrees with the ground-truth timing in the Flickr30k Entities test data with a precision of 86.55% and a recall of 97.92%. Our model implementing this technique improves the state-of-the-art on standard captioning metrics while additionally demonstrating a considerably larger effective vocabulary size.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes