CVDec 7, 2023

Approximate Caching for Efficiently Serving Diffusion Models

Shubham Agarwal, Subrata Mitra, Sarthak Chakraborty, Srikrishna Karanam, Koyel Mukherjee, Shiv Saini

arXiv:2312.04429v18.47 citationsh-index: 8

Originality Incremental advance

AI Analysis

This work addresses the resource-intensive problem of production-grade diffusion model serving for applications requiring efficient and cost-effective image generation.

The paper tackles the high computational cost and latency of serving diffusion models for text-to-image generation by introducing approximate caching, which reuses intermediate noise states from similar prompts to reduce iterative denoising steps, resulting in 19.8% latency reduction and 19% dollar savings on average in production workloads.

Text-to-image generation using diffusion models has seen explosive popularity owing to their ability in producing high quality images adhering to text prompts. However, production-grade diffusion model serving is a resource intensive task that not only require high-end GPUs which are expensive but also incurs considerable latency. In this paper, we introduce a technique called approximate-caching that can reduce such iterative denoising steps for an image generation based on a prompt by reusing intermediate noise states created during a prior image generation for similar prompts. Based on this idea, we present an end to end text-to-image system, Nirvana, that uses the approximate-caching with a novel cache management-policy Least Computationally Beneficial and Frequently Used (LCBFU) to provide % GPU compute savings, 19.8% end-to-end latency reduction and 19% dollar savings, on average, on two real production workloads. We further present an extensive characterization of real production text-to-image prompts from the perspective of caching, popularity and reuse of intermediate states in a large production environment.

View on arXiv PDF

Similar