Aesthetic Image Captioning From Weakly-Labelled Photographs
This work addresses the problem of generating critical textual feedback for photographs, which is incremental as it builds on existing natural image captioning methods by adapting them to the aesthetic domain with new data and training strategies.
The paper tackled the lack of large-scale clean datasets for aesthetic image captioning by proposing an automatic cleaning strategy to create AVA-Captions, a dataset with 230,000 images and 5 captions each, and introduced a weakly supervised method to train visual feature extractors for this task.
Aesthetic image captioning (AIC) refers to the multi-modal task of generating critical textual feedbacks for photographs. While in natural image captioning (NIC), deep models are trained in an end-to-end manner using large curated datasets such as MS-COCO, no such large-scale, clean dataset exists for AIC. Towards this goal, we propose an automatic cleaning strategy to create a benchmarking AIC dataset, by exploiting the images and noisy comments easily available from photography websites. We propose a probabilistic caption-filtering method for cleaning the noisy web-data, and compile a large-scale, clean dataset "AVA-Captions", (230, 000 images with 5 captions per image). Additionally, by exploiting the latent associations between aesthetic attributes, we propose a strategy for training the convolutional neural network (CNN) based visual feature extractor, the first component of the AIC framework. The strategy is weakly supervised and can be effectively used to learn rich aesthetic representations, without requiring expensive ground-truth annotations. We finally show-case a thorough analysis of the proposed contributions using automatic metrics and subjective evaluations.