CVJun 4, 2024

Enhance Image-to-Image Generation with LLaVA-generated Prompts

arXiv:2406.01956v327 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of generating more faithful and coherent images in image-to-image tasks, though it is incremental as it builds on existing multimodal and generation methods.

This paper tackles the problem of improving image-to-image generation by using LLaVA to generate textual prompts from input images, resulting in enhanced visual coherence and stronger resemblance between generated and input images.

This paper presents a novel approach to enhance image-to-image generation by leveraging the multimodal capabilities of the Large Language and Vision Assistant (LLaVA). We propose a framework where LLaVA analyzes input images and generates textual descriptions, hereinafter LLaVA-generated prompts. These prompts, along with the original image, are fed into the image-to-image generation pipeline. This enriched representation guides the generation process towards outputs that exhibit a stronger resemblance to the input image. Extensive experiments demonstrate the effectiveness of LLaVA-generated prompts in promoting image similarity. We observe a significant improvement in the visual coherence between the generated and input images compared to traditional methods. Future work will explore fine-tuning LLaVA prompts for increased control over the creative process. By providing more specific details within the prompts, we aim to achieve a delicate balance between faithfulness to the original image and artistic expression in the generated outputs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes