DCMay 11

Accelerating Compound LLM Training Workloads with Maestro

Xiulong Yuan, Hongqing Chen, Jiaxuan Peng, Fan Zhou, Zhixiang Ruan, Zekun Wang, Bo Zheng, Rui Men, Haiquan Wang, Zhipeng Zhang, Langshi Chen, Man Yuan

arXiv:2605.1050156.9

Predicted impact top 1% in DC · last 90 daysOriginality Incremental advance

AI Analysis

For practitioners training complex multi-component LLMs, Maestro provides a practical framework that significantly improves GPU utilization and throughput over monolithic approaches.

Maestro addresses static and dynamic heterogeneity in compound LLM training workloads (e.g., knowledge distillation, MLLM) by restructuring them into section graphs with independent configurations and using wavefront scheduling to dynamically reorder inputs, achieving ~40% GPU consumption reduction in production.

Compound LLM training workloads-such as knowledge distillation and multimodal LLM (MLLM) training-are gaining prominence. These typically comprise heterogeneous components differing in parameter scale, execution mode (forward-only or full forward-backward), and sequence length. Besides, component activation can be data-dependent: in MLLM training, modality-specific parts activate only when inputs contain corresponding modalities, causing dynamic computational paths and irregular runtime workloads. Conventional frameworks, designed for monolithic models, cannot handle the dual heterogeneity-static (across components) and dynamic (runtime). By enforcing one-size-fits-all training configurations across components and ignoring input-induced variations, they suffer suboptimal throughput and poor GPU utilization. In this paper, we introduce Maestro, a section-centric training framework that addresses both challenges. Maestro first restructures the workload into a coarse-grained section graph. Each section independently configures its parallelism strategy, micro-batch size, and data-parallel degree-enabling fine-grained, component-aware resource allocation to tackle static heterogeneity. To tackle runtime irregularity, Maestro introduces a wavefront scheduling algorithm that dynamically reorders input samples to orchestrate concurrent section execution while preserving cross-section data dependencies. This maximizes inter-section parallelism and minimizes stalls, boosting hardware utilization. Deployed in production for millions of GPU hours, Maestro reduces GPU consumption by ~40% on key workloads-including knowledge distillation and MLLM training-validating its real-world impact.

View on arXiv PDF

Similar