DCApr 6

GENSERVE: Efficient Co-Serving of Heterogeneous Diffusion Model Workloads

arXiv:2604.0433584.9
Predicted impact top 3% in DC · last 90 daysOriginality Incremental advance
AI Analysis

This addresses the challenge for production platforms needing to efficiently serve mixed AI workloads, but it is incremental as it builds on existing diffusion model serving systems.

The paper tackled the problem of co-serving heterogeneous diffusion model workloads (text-to-image and text-to-video) on shared GPU clusters to meet latency SLOs, and the result was GENSERVE, a system that improved SLO attainment rates by up to 44% over baselines.

Diffusion models have emerged as the prevailing approach for text-to-image (T2I) and text-to-video (T2V) generation, yet production platforms must increasingly serve both modalities on shared GPU clusters while meeting stringent latency SLOs. Co-serving such heterogeneous workloads is challenging: T2I and T2V requests exhibit vastly different compute demands, parallelism characteristics, and latency requirements, leading to significant SLO violations in existing serving systems. We present GENSERVE, a co-serving system that leverages the inherent predictability of the diffusion process to optimize serving efficiency. A central insight is that diffusion inference proceeds in discrete, predictable steps and is naturally preemptible at step boundaries, opening a new design space for heterogeneity-aware resource management. GENSERVE introduces step-level resource adaptation through three coordinated mechanisms: intelligent video preemption, elastic sequence parallelism with dynamic batching, and an SLO-aware scheduler that jointly optimizes resource allocation across all concurrent requests. Experimental results show that GENSERVE improves the SLO attainment rate by up to 44% over the strongest baseline across diverse configurations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes