Skeleton Key: Image Captioning by Skeleton-Attribute Decomposition
This method addresses the problem of generating more human-like and customizable image descriptions for applications in AI and vision-language tasks, though it is incremental as it builds on existing coarse-to-fine approaches.
The paper tackles image captioning by decomposing descriptions into skeleton sentences and attributes, generating them separately to improve accuracy and novelty. Experimental results on MS-COCO and Stock3M datasets show consistent improvements across metrics, especially SPICE, which correlates better with human ratings.
Recently, there has been a lot of interest in automatically generating descriptions for an image. Most existing language-model based approaches for this task learn to generate an image description word by word in its original word order. However, for humans, it is more natural to locate the objects and their relationships first, and then elaborate on each object, describing notable attributes. We present a coarse-to-fine method that decomposes the original image description into a skeleton sentence and its attributes, and generates the skeleton sentence and attribute phrases separately. By this decomposition, our method can generate more accurate and novel descriptions than the previous state-of-the-art. Experimental results on the MS-COCO and a larger scale Stock3M datasets show that our algorithm yields consistent improvements across different evaluation metrics, especially on the SPICE metric, which has much higher correlation with human ratings than the conventional metrics. Furthermore, our algorithm can generate descriptions with varied length, benefiting from the separate control of the skeleton and attributes. This enables image description generation that better accommodates user preferences.