4 Papers

82.9DCMay 20Code
DynaFlow: Transparent and Flexible Intra-Device Parallelism via Programmable Operator Scheduling

Yi Pan, Yile Gu, Jinbin Luo et al.

Intra-device parallelism addresses resource under-utilization in ML inference and training by overlapping the execution of operators with different resource usage. However, its wide adoption is hindered by a fundamental conflict with the static, sequential programming model of existing frameworks. Integrating these strategies requires invasive, model-specific code overhauls, representing an intractable engineering cost. This is further amplified by the high sensitivity of strategies to execution contexts (e.g., workload, model architecture, hardware), forcing developers to implement and maintain multiple specialized solutions. To address this, we propose DynaFlow, a framework that enables the transparent and flexible integration of intra-device parallelism by decoupling the logical model definition from the physical execution schedule. DynaFlow introduces a flexible frontend with annotations for graph partitioning and a programmable interface for defining custom intra-device parallelism strategies. Its efficient backend manages complex control/data-flow asynchronously, uses custom memory management to eliminate copy overheads, and preserves compatibility with optimizations like CUDA Graphs and TorchInductor. We demonstrate that DynaFlow can integrate representative parallelism strategies into 6 state-of-the-art ML systems with minimal code changes, achieving up to a 1.29x throughput improvement. DynaFlow is publicly available at https://github.com/uw-syfi/DynaFlow.

89.2DCMay 21
LiveR: Fine-Grained Elasticity via Live Reconfiguration for Model Training

Haoyuan Liu, Kairui Zhou, Shuyao Qi et al.

To reduce user costs and maximize cluster utilization, large model training increasingly leverages volatile but inexpensive GPU capacity, such as spot instances and reclaimable resources in shared clusters. Yet, capitalizing on these economic benefits requires jobs to adapt within the short warning windows that many such environments provide. Existing elastic training systems still treat reconfiguration as stop-and-restart: they externalize distributed state through checkpoints, rebuild the distributed runtime on a new topology, and restart training, turning each resize event into a storage-heavy recovery procedure that incurs substantial downtime from checkpoint I/O, process restart, CUDA initialization, and communicator setup. We present LiveR, a live reconfiguration runtime for elastic LLM training that replaces storage-backed restart with a live, bounded-memory handoff between mixed-parallel training worlds. While the current world continues training, LiveR asynchronously prepares the target world, bootstraps newly added workers in isolation to keep heavyweight initialization off the critical path, and streams model state directly over high-bandwidth interconnects while reshaping it online across tensor, pipeline, and data parallel dimensions. Once the target world is ready, LiveR performs a lightweight commit that switches training to the new configuration without stop-and-restart on the live path. We implement LiveR atop Megatron-LM and PyTorch and evaluate it end-to-end on a multi-node GPU cluster. Across diverse reconfiguration scenarios, LiveR reduces downtime from minutes to seconds, accelerates reconfiguration by 14$\times$-23$\times$ over checkpoint/restart baselines, incurs minimal steady-state overhead, and sustains up to 99% training goodput under volatile-resource conditions, making volatile low-cost GPU capacity far more practical for LLM training.

LGJan 29
SAIR: Cost-Efficient Multi-Stage ML Pipeline Autoscaling via In-Context Reinforcement Learning

Jianchang Su, Yifan Zhang, Shengkai Lin et al.

Multi-stage ML inference pipelines are difficult to autoscale due to heterogeneous resources, cross-stage coupling, and dynamic bottleneck migration. We present SAIR, an autoscaling framework that uses an LLM as an in-context reinforcement learning controller, improving its policy online from reward-labeled interaction histories without gradient updates. SAIR combines Pareto-dominance reward shaping with a provable separation margin, surprisal-guided experience retrieval for context efficiency, and fine-grained GPU rate control via user-space CUDA interception. We provide regret analysis decomposing error into retrieval coverage and LLM selection components. On four ML serving pipelines under three workload patterns, SAIR achieves the best or tied-best P99 latency and effective resource cost among deployed baselines, improving P99 by up to 50% and reducing effective cost by up to 97% (under GPU rate-control assumptions), with 86% bottleneck detection accuracy and no offline training.

84.5NIApr 30
Libra: Accelerating Socket I/O via Programmable Selective Data Copying

Kairui Zhou, Shengkai Lin, Wei Zhang et al.

Layer-7 (L7) proxies are critical to modern cloud-native systems, yet their performance is increasingly bottlenecked by copying entire payloads across the kernel-user boundary. Existing approaches reduce this overhead but typically sacrifice compatibility with unmodified POSIX applications, introduce new APIs, or require specialized environments. We show that, under conventional OS abstractions, fully eliminating kernel-user copies while preserving standard socket semantics for unmodified proxies is fundamentally impossible. This leads to a practical insight: in common L7 workloads, proxies inspect only small metadata (e.g., HTTP headers) for routing, while forwarding the bulk payload unchanged. Based on this insight, we present Libra, an OS-level selective-copy framework that copies only metadata to the user space and retains the bulk payload in the kernel for forwarding, reducing data movement without breaking compatibility. Libra uses eBPF to identify protocol-specific metadata boundaries and coordinate selective copy and payload reuse across receive and transmit paths, all without modifying the socket API. Implemented in Linux and evaluated with unmodified Nginx and HAProxy, Libra improves plaintext throughput by up to 4.2x and reduces P99 tail latency by over 90%. With hardware-offloaded kTLS, it boosts encrypted throughput by 2.0x and cuts tail latency by 65%.