Dennis Cai

DC
h-index9
5papers
33citations
Novelty55%
AI Score34

5 Papers

DCNov 10, 2023
CloudEval-YAML: A Practical Benchmark for Cloud Configuration Generation

Yifei Xu, Yuning Chen, Xumiao Zhang et al.

Among the thriving ecosystem of cloud computing and the proliferation of Large Language Model (LLM)-based code generation tools, there is a lack of benchmarking for code generation in cloud-native applications. In response to this need, we present CloudEval-YAML, a practical benchmark for cloud configuration generation. CloudEval-YAML tackles the diversity challenge by focusing on YAML, the de facto standard of numerous cloud-native tools. We develop the CloudEval-YAML benchmark with practicality in mind: the dataset consists of hand-written problems with unit tests targeting practical scenarios. We further enhanced the dataset to meet practical needs by rephrasing questions in a concise, abbreviated, and bilingual manner. The dataset consists of 1011 problems that take more than 1200 human hours to complete. To improve practicality during evaluation, we build a scalable evaluation platform for CloudEval-YAML that achieves a 20 times speedup over a single machine. To the best of our knowledge, the CloudEval-YAML dataset is the first hand-written dataset targeting cloud-native applications. We present an in-depth evaluation of 12 LLMs, leading to a deeper understanding of the problems and LLMs, as well as effective methods to improve task performance and reduce cost.

DCDec 17, 2024
TrainMover: An Interruption-Resilient and Reliable ML Training Runtime

ChonLam Lao, Minlan Yu, Aditya Akella et al.

Large-scale ML training jobs are frequently interrupted by hardware and software anomalies, failures, and management events. Existing solutions like checkpointing or runtime reconfiguration suffer from long downtimes, degraded performance, or undesired changes to training strategies. We present TrainMover, a resilient runtime that leverages standby machines to handle interruptions with minimal downtime and zero memory overhead. To achieve these goals, TrainMover introduces two key techniques: two-phase, delta-based communication group setups and communication-free sandboxed shadow iterations. Our evaluation shows that TrainMover consistently achieves second-level downtime across all evaluated models during migration, maintaining 99\% training efficiency during periodic 10-minute rebalancing. We also demonstrate the effectiveness of TrainMover in handling various interruptions.

DCJun 10, 2025
PerfTracker: Online Performance Troubleshooting for Large-scale Model Training in Production

Yu Guan, Zhiyu Yin, Haoyu Chen et al.

Troubleshooting performance problems of large model training (LMT) is immensely challenging, due to unprecedented scales of modern GPU clusters, the complexity of software-hardware interactions, and the data intensity of the training process. Existing troubleshooting approaches designed for traditional distributed systems or datacenter networks fall short and can hardly apply to real-world training systems. In this paper, we present PerfTracker, the first online troubleshooting system utilizing fine-grained profiling, to diagnose performance issues of large-scale model training in production. PerfTracker can diagnose performance issues rooted in both hardware (e.g., GPUs and their interconnects) and software (e.g., Python functions and GPU operations). It scales to LMT on modern GPU clusters. PerfTracker effectively summarizes runtime behavior patterns of fine-grained LMT functions via online profiling, and leverages differential observability to localize the root cause with minimal production impact. PerfTracker has been deployed as a production service for large-scale GPU clusters of O(10, 000) GPUs (product homepage https://help.aliyun.com/zh/pai/user-guide/perftracker-online-performance-analysis-diagnostic-tool). It has been used to diagnose a variety of difficult performance issues.

NIOct 29, 2024
Cora: Accelerating Stateful Network Applications with SmartNICs

Shaoke Xi, Jiaqi Gao, Mengqi Liu et al.

With the growing performance requirements on networked applications, there is a new trend of offloading stateful network applications to SmartNICs to improve performance and reduce the total cost of ownership. However, offloading stateful network applications is non-trivial due to state operation complexity, state resource consumption, and the complicated relationship between traffic and state. Naively partitioning the program by state or traffic can result in a suboptimal partition plan with higher CPU usage or even packet drops. In this paper, we propose Cora, a compiler and runtime that offloads stateful network applications to SmartNIC-accelerated hosts. Cora compiler introduces an accurate performance model for each SmartNIC and employs an efficient compiling algorithm to search the offloading plan. Cora runtime can monitor traffic dynamics and adapt to minimize CPU usage. Cora is built atop Netronome Agilio and BlueField 2 SmartNICs. Our evaluation shows that for the same throughput target, Cora can propose partition plans saving up to 94.0% CPU cores, 1.9 times more than baseline solutions. Under the same resource constraint, Cora can accelerate network functions by 44.9%-82.3%. Cora runtime can adapt to traffic changes and keep CPU usage low.

DCJun 7, 2024
Enhancing Large-Scale AI Training Efficiency: The C4 Solution for Real-Time Anomaly Detection and Communication Optimization

Jianbo Dong, Bin Luo, Jun Zhang et al.

The emergence of Large Language Models (LLMs) has necessitated the adoption of distributed training techniques, involving the deployment of thousands of GPUs to train a single model. Unfortunately, the efficiency of large-scale distributed training systems is often suboptimal due to the increased likelihood of hardware errors in high-end GPU products and the heightened risk of network traffic collisions. Moreover, any local hardware failure can disrupt training tasks, and the inability to swiftly identify faulty components leads to a significant waste of GPU resources. And, prolonged communication due to traffic collisions can substantially increase GPU waiting times. To address these challenges, we propose a communication-driven solution, namely the C4. The key insights of C4 are twofold. First, the load in distributed training exhibits homogeneous characteristics and is divided into iterations through periodic synchronization, therefore hardware anomalies would incur certain syndrome in collective communication. By leveraging this feature, C4 can rapidly identify the faulty components, swiftly isolate the anomaly, and restart the task, thereby avoiding resource wastage caused by delays in anomaly detection. Second, the predictable communication model of collective communication, involving a limited number of long-lived flows, allows C4 to efficiently execute traffic planning, substantially reducing bandwidth competition among these flows. The C4 has been extensively deployed across real-world production systems in a hyperscale cloud provider, yielding a significant improvement in system efficiency, from 30% to 45%. This enhancement is attributed to a 30% reduction in error-induced overhead and a 15% reduction in communication costs.