DCMar 6
StreamWise: Serving Multi-Modal Generation in Real-Time at ScaleHaoran Qiu, Gohar Irfan Chaudhry, Chaojie Zhang et al.
Advances in multi-modal generative models are enabling new applications, from storytelling to automated media synthesis. Most current workloads generate simple outputs (e.g., image generation from a prompt) in batch mode, often requiring several seconds even for basic results. Serving real-time multi-modal workflows at scale is costly and complex, requiring efficient coordination of diverse models (each with unique resource needs) across language, audio, image, and video, all under strict latency and resource constraints. We tackle these challenges through the lens of real-time podcast video generation, integrating LLMs, text-to-speech, and video-audio generation. To meet tight SLOs, we design an adaptive, modular serving system, StreamWise, that dynamically manages quality (e.g., resolution, sharpness), model/content parallelism, and resource-aware scheduling. We leverage heterogeneous hardware to maximize responsiveness and efficiency. For example, the system can lower video resolution and allocate more resources to early scenes. We quantify the trade-offs between latency, cost, and quality. The cheapest setup generates a 10-minute podcast video on A100 GPUs in 1.4 hours (8.4x slower than the real-time) for less than \$25. StreamWise enables high-quality real-time streaming with a sub-second startup delay under $45.
DCMar 4
Benchmarking Compound AI Applications for Hardware-Software Co-DesignParamuth Samuthrsindh, Angel Cervantes, Varun Gohil et al.
Compound AI applications, composed from interactions between Large Language Models (LLMs), Machine Learning (ML) models, external tools and data sources are quickly becoming an integral workload in datacenters. Their diverse sub-components and use-cases present a large configuration-space across the deployment stack -- ranging from applications and serving software down to hardware -- each of which may influence the application performance, deployment cost, and/or resource consumption. Despite their rapid adoption, however, the systems community lacks a standardized benchmark for analyzing this complicated design-space and guiding in system design. In this work, we present our benchmarking suite used for cross-stack analysis of Compound AI applications. Using this, we derive key takeaways and design principles spanning several layers of the stack for hardware-software co-design to unlock higher resource-efficiency.
DCJan 28, 2025
Towards Resource-Efficient Compound AI SystemsGohar Irfan Chaudhry, Esha Choukse, Íñigo Goiri et al.
Compound AI Systems, integrating multiple interacting components like models, retrievers, and external tools, have emerged as essential for addressing complex AI tasks. However, current implementations suffer from inefficient resource utilization due to tight coupling between application logic and execution details, a disconnect between orchestration and resource management layers, and the perceived exclusiveness between efficiency and quality. We propose a vision for resource-efficient Compound AI Systems through a declarative workflow programming model and an adaptive runtime system for dynamic scheduling and resource-aware decision-making. Decoupling application logic from low-level details exposes levers for the runtime to flexibly configure the execution environment and resources, without compromising on quality. Enabling collaboration between the workflow orchestration and cluster manager enables higher efficiency through better scheduling and resource management. We are building a prototype system, called Murakkab, to realize this vision. Our preliminary evaluation demonstrates speedups up to $\sim 3.4\times$ in workflow completion times while delivering $\sim 4.5\times$ higher energy efficiency, showing promise in optimizing resources and advancing AI system design.
MAAug 22, 2025
Murakkab: Resource-Efficient Agentic Workflow Orchestration in Cloud PlatformsGohar Irfan Chaudhry, Esha Choukse, Haoran Qiu et al.
Agentic workflows commonly coordinate multiple models and tools with complex control logic. They are quickly becoming the dominant paradigm for AI applications. However, serving them remains inefficient with today's frameworks. The key problem is that they expose workflows as opaque sequences of model and tool calls that tightly couple agent logic with model and hardware choices. Often, these workflow components are fragmented across different entities, preventing systems from reasoning about trade-offs across accuracy, latency, energy, and cost. This leads to resource waste and degraded service-level objectives (SLOs). We present Murakkab, a resource-efficient serving system for agentic workflows. Murakkab introduces a declarative abstraction that decouples workflow specification from execution configuration. A profile-guided optimizer and adaptive runtime jointly manage the full stack: orchestrating workflow components, mapping them to models and hardware, and dynamically reconfiguring execution to satisfy user-defined SLOs. By exposing the internal structure of agentic workflows, Murakkab enables cross-layer optimization that existing frameworks and cloud schedulers cannot achieve. Our evaluation on diverse workflows shows that Murakkab reduces GPU usage by up to 2.8$\times$, energy consumption by 3.7$\times$, and cost by 4.3$\times$ while maintaining SLOs.