CVApr 28, 2022

Controllable Image Captioning

arXiv:2204.13324v4

Originality Incremental advance

AI Analysis

This addresses the need for interpretable and customizable image captioning for different users, though it is an incremental improvement over existing methods.

The paper tackles the problem of generating diverse and controllable image captions by introducing a framework that uses Part-Of-Speech tags as control signals, resulting in significantly outperforming state-of-the-art methods on public datasets.

State-of-the-art image captioners can generate accurate sentences to describe images in a sequence to sequence manner without considering the controllability and interpretability. This, however, is far from making image captioning widely used as an image can be interpreted in infinite ways depending on the target and the context at hand. Achieving controllability is important especially when the image captioner is used by different people with different way of interpreting the images. In this paper, we introduce a novel framework for image captioning which can generate diverse descriptions by capturing the co-dependence between Part-Of-Speech tags and semantics. Our model decouples direct dependence between successive variables. In this way, it allows the decoder to exhaustively search through the latent Part-Of-Speech choices, while keeping decoding speed proportional to the size of the POS vocabulary. Given a control signal in the form of a sequence of Part-Of-Speech tags, we propose a method to generate captions through a Transformer network, which predicts words based on the input Part-Of-Speech tag sequences. Experiments on publicly available datasets show that our model significantly outperforms state-of-the-art methods on generating diverse image captions with high qualities.

View on arXiv PDF

Similar