LGMLApr 23, 2020

Efficient Neural Architecture for Text-to-Image Synthesis

arXiv:2004.11437v125 citations
Originality Highly original
AI Analysis

This work addresses the complexity of combining text and image modalities for researchers in generative AI, offering a more streamlined approach.

The paper tackles the problem of text-to-image synthesis by proposing an efficient neural architecture that achieves state-of-the-art performance using single-stage training with a single generator and discriminator, eliminating the need for multi-stage strategies.

Text-to-image synthesis is the task of generating images from text descriptions. Image generation, by itself, is a challenging task. When we combine image generation and text, we bring complexity to a new level: we need to combine data from two different modalities. Most of recent works in text-to-image synthesis follow a similar approach when it comes to neural architectures. Due to aforementioned difficulties, plus the inherent difficulty of training GANs at high resolutions, most methods have adopted a multi-stage training strategy. In this paper we shift the architectural paradigm currently used in text-to-image methods and show that an effective neural architecture can achieve state-of-the-art performance using a single stage training with a single generator and a single discriminator. We do so by applying deep residual networks along with a novel sentence interpolation strategy that enables learning a smooth conditional space. Finally, our work points a new direction for text-to-image research, which has not experimented with novel neural architectures recently.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes