CVMar 8, 2024

CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion

arXiv:2403.05121v181 citationsh-index: 36Has CodeECCV
AI Analysis

This work addresses efficiency and quality issues in text-to-image generation for AI applications, representing an incremental improvement through a novel cascaded approach.

The authors tackled the challenges of computational efficiency and detail refinement in text-to-image diffusion models by proposing CogView3, a cascaded framework using relay diffusion, which outperforms SDXL by 77.0% in human evaluations and reduces inference time by half.

Recent advancements in text-to-image generative systems have been largely driven by diffusion models. However, single-stage text-to-image diffusion models still face challenges, in terms of computational efficiency and the refinement of image details. To tackle the issue, we propose CogView3, an innovative cascaded framework that enhances the performance of text-to-image diffusion. CogView3 is the first model implementing relay diffusion in the realm of text-to-image generation, executing the task by first creating low-resolution images and subsequently applying relay-based super-resolution. This methodology not only results in competitive text-to-image outputs but also greatly reduces both training and inference costs. Our experimental results demonstrate that CogView3 outperforms SDXL, the current state-of-the-art open-source text-to-image diffusion model, by 77.0\% in human evaluations, all while requiring only about 1/2 of the inference time. The distilled variant of CogView3 achieves comparable performance while only utilizing 1/10 of the inference time by SDXL.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes