CVOct 28, 2020

Leveraging Visual Question Answering to Improve Text-to-Image Synthesis

arXiv:2010.14953v1990 citations
Originality Incremental advance
AI Analysis

This addresses a specific bottleneck in text-to-image synthesis for generating complex multi-object scenes, representing an incremental improvement.

The paper tackles the problem of generating images with multiple objects from text by combining Text-to-Image synthesis with Visual Question Answering, resulting in improved image quality and alignment with text (FID lowered from 27.84 to 25.38 and R-prec. increased from 83.82% to 84.79%).

Generating images from textual descriptions has recently attracted a lot of interest. While current models can generate photo-realistic images of individual objects such as birds and human faces, synthesising images with multiple objects is still very difficult. In this paper, we propose an effective way to combine Text-to-Image (T2I) synthesis with Visual Question Answering (VQA) to improve the image quality and image-text alignment of generated images by leveraging the VQA 2.0 dataset. We create additional training samples by concatenating question and answer (QA) pairs and employ a standard VQA model to provide the T2I model with an auxiliary learning signal. We encourage images generated from QA pairs to look realistic and additionally minimize an external VQA loss. Our method lowers the FID from 27.84 to 25.38 and increases the R-prec. from 83.82% to 84.79% when compared to the baseline, which indicates that T2I synthesis can successfully be improved using a standard VQA model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes