Remember What You have drawn: Semantic Image Manipulation with Memory
It addresses the challenging task of semantic image manipulation with natural language for computer vision and NLP applications, but appears incremental as it builds on prior efforts.
The paper tackles the problem of generating realistic and text-conformed manipulated images by proposing a memory-based network (MIM-Net) with a two-stage approach, target localization, and randomized memory training, achieving better performance on four popular datasets compared to existing methods.
Image manipulation with natural language, which aims to manipulate images with the guidance of language descriptions, has been a challenging problem in the fields of computer vision and natural language processing (NLP). Currently, a number of efforts have been made for this task, but their performances are still distant away from generating realistic and text-conformed manipulated images. Therefore, in this paper, we propose a memory-based Image Manipulation Network (MIM-Net), where a set of memories learned from images is introduced to synthesize the texture information with the guidance of the textual description. We propose a two-stage network with an additional reconstruction stage to learn the latent memories efficiently. To avoid the unnecessary background changes, we propose a Target Localization Unit (TLU) to focus on the manipulation of the region mentioned by the text. Moreover, to learn a robust memory, we further propose a novel randomized memory training loss. Experiments on the four popular datasets show the better performance of our method compared to the existing ones.