DC LGJan 16, 2025

The Streaming Batch Model for Efficient and Fault-Tolerant Heterogeneous Execution

Frank Sifei Luan, Ron Yifeng Wang, Yile Gu, Ziming Mao, Charlotte Lin, Amog Kamsetty, Hao Chen, Cheng Su, Balaji Veeramani, Scott Lee, SangBin Cho, Clark Zinzow

arXiv:2501.12407v53.33 citationsh-index: 18

Originality Incremental advance

AI Analysis

This addresses inefficiencies in distributed data processing for ML workloads, enabling better utilization of heterogeneous resources, though it is incremental as it builds on batch and streaming models.

The paper tackles the bottleneck of CPU-based data processing in ML by introducing the streaming batch model, which improves throughput on heterogeneous batch inference pipelines by 2.5-12x and training throughput for multimodal models like Stable Diffusion by 31% compared to existing systems.

While ML model training and inference are both GPU-intensive, CPU-based data processing is often the bottleneck. Distributed data processing systems based on the batch or stream processing models assume homogeneous resource requirements. They excel at CPU-based computation but either under-utilize heterogeneous resources or impose high overheads on failure and reconfiguration. We introduce the streaming batch model, a hybrid of batch and streaming that enables efficient and fault-tolerant heterogeneous execution. The key idea is to use partitions as the unit of execution to achieve elasticity, but to allow partitions to be dynamically created and streamed between heterogeneous operators for memory-efficient pipelining. We present Ray Data, a streaming batch system that improves throughput on heterogeneous batch inference pipelines by 2.5-12$\times$ compared to traditional batch and stream processing systems. By leveraging heterogeneous clusters, Ray Data improves training throughput for multimodal models such as Stable Diffusion by 31% compared to single-node ML data loaders.

View on arXiv PDF

Similar