CVNov 27, 2022

CLID: Controlled-Length Image Descriptions with Limited Data

arXiv:2211.14835v26 citationsh-index: 5
Originality Incremental advance
AI Analysis

This work addresses the problem of limited long caption data for image captioning models, enabling more flexible and detailed descriptions, though it is incremental in improving existing controllable captioning methods.

The paper tackles the challenge of generating image captions with controlled length, particularly long captions, by enriching datasets with self-generated captions of varying lengths and introducing a novel training strategy that selects data points at different training stages. The method significantly improves length-control abilities and achieves state-of-the-art performance in caption quality, with applicability extended to paragraph generation.

Controllable image captioning models generate human-like image descriptions, enabling some kind of control over the generated captions. This paper focuses on controlling the caption length, i.e. a short and concise description or a long and detailed one. Since existing image captioning datasets contain mostly short captions, generating long captions is challenging. To address the shortage of long training examples, we propose to enrich the dataset with varying-length self-generated captions. These, however, might be of varying quality and are thus unsuitable for conventional training. We introduce a novel training strategy that selects the data points to be used at different times during the training. Our method dramatically improves the length-control abilities, while exhibiting SoTA performance in terms of caption quality. Our approach is general and is shown to be applicable also to paragraph generation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes