CVMar 26, 2019

Unpaired Image Captioning via Scene Graph Alignments

Jiuxiang Gu, Shafiq Joty, Jianfei Cai, Handong Zhao, Xu Yang, Gang Wang

arXiv:1903.10658v425.2191 citations

Originality Highly original

AI Analysis

This addresses the labor-intensive challenge of collecting paired image-caption datasets for image captioning models.

The paper tackles the problem of generating image captions without paired training data by proposing a scene graph-based approach with unsupervised feature alignment, achieving results that outperform existing methods by a wide margin.

Most of current image captioning models heavily rely on paired image-caption datasets. However, getting large scale image-caption paired data is labor-intensive and time-consuming. In this paper, we present a scene graph-based approach for unpaired image captioning. Our framework comprises an image scene graph generator, a sentence scene graph generator, a scene graph encoder, and a sentence decoder. Specifically, we first train the scene graph encoder and the sentence decoder on the text modality. To align the scene graphs between images and sentences, we propose an unsupervised feature alignment method that maps the scene graph features from the image to the sentence modality. Experimental results show that our proposed model can generate quite promising results without using any image-caption training pairs, outperforming existing methods by a wide margin.

View on arXiv PDF

Similar