CLJun 23, 2023

Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation

Zihao Yue, Anwen Hu, Liang Zhang, Qin Jin

arXiv:2306.13460v32.97 citationsh-index: 17

Originality Incremental advance

AI Analysis

This addresses the issue of generating detailed and varied captions for images, which is important for applications like accessibility and content indexing, but is incremental as it modifies an existing training objective.

The paper tackled the problem of image captioning models generating overly general descriptions due to conflicting optimization directions in maximum likelihood estimation, and introduced Semipermeable Maximum Likelihood Estimation (SMILE) to encourage richer captions, demonstrating significant enhancements in descriptiveness on MSCOCO and Flickr30K datasets.

Image captioning aims to describe visual content in natural language. As 'a picture is worth a thousand words', there could be various correct descriptions for an image. However, with maximum likelihood estimation as the training objective, the captioning model is penalized whenever its prediction mismatches with the label. For instance, when the model predicts a word expressing richer semantics than the label, it will be penalized and optimized to prefer more concise expressions, referred to as conciseness optimization. In contrast, predictions that are more concise than labels lead to richness optimization. Such conflicting optimization directions could eventually result in the model generating general descriptions. In this work, we introduce Semipermeable MaxImum Likelihood Estimation (SMILE), which allows richness optimization while blocking conciseness optimization, thus encouraging the model to generate longer captions with more details. Extensive experiments on two mainstream image captioning datasets MSCOCO and Flickr30K demonstrate that SMILE significantly enhances the descriptiveness of generated captions. We further provide in-depth investigations to facilitate a better understanding of how SMILE works.

View on arXiv PDF

Similar