CVMar 16, 2025

UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing

arXiv:2503.12652v217 citationsh-index: 20
Originality Incremental advance
AI Analysis

This addresses the inefficiency of multiple specialized models for image generation, offering a unified solution for researchers and practitioners, though it is incremental in building on existing diffusion models.

The paper tackles the problem of requiring separate models for different image generation tasks by introducing UniVG, a generalist diffusion model that uses a single set of weights to handle tasks like text-to-image generation, inpainting, and editing, achieving performance that can outperform some task-specific models on benchmarks.

Text-to-Image (T2I) diffusion models have shown impressive results in generating visually compelling images following user prompts. Building on this, various methods further fine-tune the pre-trained T2I model for specific tasks. However, this requires separate model architectures, training designs, and multiple parameter sets to handle different tasks. In this paper, we introduce UniVG, a generalist diffusion model capable of supporting a diverse range of image generation tasks with a single set of weights. UniVG treats multi-modal inputs as unified conditions to enable various downstream applications, ranging from T2I generation, inpainting, instruction-based editing, identity-preserving generation, and layout-guided generation, to depth estimation and referring segmentation. Through comprehensive empirical studies on data mixing and multi-task training, we provide detailed insights into the training processes and decisions that inform our final designs. For example, we show that T2I generation and other tasks, such as instruction-based editing, can coexist without performance trade-offs, while auxiliary tasks like depth estimation and referring segmentation enhance image editing. Notably, our model can even outperform some task-specific models on their respective benchmarks, marking a significant step towards a unified image generation model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes