LGCLCVMLSep 27, 2018

Semantically Invariant Text-to-Image Generation

arXiv:1809.10274v110 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of integrating visual and text modalities for AI applications, representing an incremental advancement in multimodal generation.

The paper tackles the problem of bidirectional generation between images and text by introducing the Multi-Modal Vector Representation (MMVR) architecture, which improves text-conditioned image generation by over 20% through a novel n-gram metric cost function and the use of multiple semantically similar sentences.

Image captioning has demonstrated models that are capable of generating plausible text given input images or videos. Further, recent work in image generation has shown significant improvements in image quality when text is used as a prior. Our work ties these concepts together by creating an architecture that can enable bidirectional generation of images and text. We call this network Multi-Modal Vector Representation (MMVR). Along with MMVR, we propose two improvements to the text conditioned image generation. Firstly, a n-gram metric based cost function is introduced that generalizes the caption with respect to the image. Secondly, multiple semantically similar sentences are shown to help in generating better images. Qualitative and quantitative evaluations demonstrate that MMVR improves upon existing text conditioned image generation results by over 20%, while integrating visual and text modalities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes