CVAICLJan 3, 2024

Instruct-Imagen: Image Generation with Multi-modal Instruction

DeepMind
arXiv:2401.01952v189 citationsh-index: 29CVPR
Originality Incremental advance
AI Analysis

This addresses the need for more flexible and generalizable image generation models for AI and creative applications, though it is incremental as it builds on pre-trained diffusion models.

The paper tackles the problem of heterogeneous image generation tasks by introducing a multi-modal instruction framework that standardizes generation intents, and results show that instruct-imagen matches or surpasses prior task-specific models in-domain while generalizing to unseen tasks.

This paper presents instruct-imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision. It uses natural language to amalgamate disparate modalities (e.g., text, edge, style, subject, etc.), such that abundant generation intents can be standardized in a uniform format. We then build instruct-imagen by fine-tuning a pre-trained text-to-image diffusion model with a two-stage framework. First, we adapt the model using the retrieval-augmented training, to enhance model's capabilities to ground its generation on external multimodal context. Subsequently, we fine-tune the adapted model on diverse image generation tasks that requires vision-language understanding (e.g., subject-driven generation, etc.), each paired with a multi-modal instruction encapsulating the task's essence. Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain and demonstrates promising generalization to unseen and more complex tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes