Zhiling Lan

h-index30

8papers

114citations

Novelty53%

AI Score43

Ranked #55,556 of 194,257 authors (top 29%)#245 in DC (top 25%)

8 Papers

4.1LGNov 14, 2025

SMART: A Surrogate Model for Predicting Application Runtime in Dragonfly Systems

Xin Wang, Pietro Lodi Rizzini, Sourav Medya et al.

The Dragonfly network, with its high-radix and low-diameter structure, is a leading interconnect in high-performance computing. A major challenge is workload interference on shared network links. Parallel discrete event simulation (PDES) is commonly used to analyze workload interference. However, high-fidelity PDES is computationally expensive, making it impractical for large-scale or real-time scenarios. Hybrid simulation that incorporates data-driven surrogate models offers a promising alternative, especially for forecasting application runtime, a task complicated by the dynamic behavior of network traffic. We present \ourmodel, a surrogate model that combines graph neural networks (GNNs) and large language models (LLMs) to capture both spatial and temporal patterns from port level router data. \ourmodel outperforms existing statistical and machine learning baselines, enabling accurate runtime prediction and supporting efficient hybrid simulation of Dragonfly networks.

1.2DCAug 15, 2025

Coordinated Power Management on Heterogeneous Systems

Zhong Zheng, Zhiling Lan, Xingfu Wu et al.

Performance prediction is essential for energy-efficient computing in heterogeneous computing systems that integrate CPUs and GPUs. However, traditional performance modeling methods often rely on exhaustive offline profiling, which becomes impractical due to the large setting space and the high cost of profiling large-scale applications. In this paper, we present OPEN, a framework consists of offline and online phases. The offline phase involves building a performance predictor and constructing an initial dense matrix. In the online phase, OPEN performs lightweight online profiling, and leverages the performance predictor with collaborative filtering to make performance prediction. We evaluate OPEN on multiple heterogeneous systems, including those equipped with A100 and A30 GPUs. Results show that OPEN achieves prediction accuracy up to 98.29\%. This demonstrates that OPEN effectively reduces profiling cost while maintaining high accuracy, making it practical for power-aware performance modeling in modern HPC environments. Overall, OPEN provides a lightweight solution for performance prediction under power constraints, enabling better runtime decisions in power-aware computing environments.

13.2SEJun 30, 2024Code

LASSI: An LLM-based Automated Self-Correcting Pipeline for Translating Parallel Scientific Codes

Matthew T. Dearing, Yiheng Tao, Xingfu Wu et al.

This paper addresses the problem of providing a novel approach to sourcing significant training data for LLMs focused on science and engineering. In particular, a crucial challenge is sourcing parallel scientific codes in the ranges of millions to billions of codes. To tackle this problem, we propose an automated pipeline framework called LASSI, designed to translate between parallel programming languages by bootstrapping existing closed- or open-source LLMs. LASSI incorporates autonomous enhancement through self-correcting loops where errors encountered during the compilation and execution of generated code are fed back to the LLM through guided prompting for debugging and refactoring. We highlight the bi-directional translation of existing GPU benchmarks between OpenMP target offload and CUDA to validate LASSI. The results of evaluating LASSI with different application codes across four LLMs demonstrate the effectiveness of LASSI for generating executable parallel codes, with 80% of OpenMP to CUDA translations and 85% of CUDA to OpenMP translations producing the expected output. We also observe approximately 78% of OpenMP to CUDA translations and 62% of CUDA to OpenMP translations execute within 10% of or at a faster runtime than the original benchmark code in the same language.

13.6AIMay 4, 2025

Leveraging LLMs to Automate Energy-Aware Refactoring of Parallel Scientific Codes

Matthew T. Dearing, Yiheng Tao, Xingfu Wu et al.

While large language models (LLMs) are increasingly used for generating parallel scientific codes, most efforts emphasize functional correctness, often overlooking performance, especially energy efficiency. We propose LASSI-EE, an automated LLM-based refactoring framework that generates energy-efficient parallel codes through a multi-stage, iterative approach integrating runtime power profiling, energy-aware prompting, self-correcting feedback loops, and an LLM-as-a-Judge agent for automated screening of code solutions. We introduce energy-reduction@k, a novel metric that quantifies expected energy reduction when generating k code candidates and selecting the most energy-efficient, enabling systematic evaluation of multi-attempt generation strategies. Evaluating 20 HeCBench applications and two miniApps on NVIDIA A100 and AMD MI100 GPUs, a single run (k=1) with LASSI-EE delivers refactored parallel codes with an average 29% expected energy reduction at an 81% pass rate, representing a 2.8x improvement over vanilla LLM prompting. Multiple runs (k=3) achieve an average 48% expected energy reduction at a 97% pass rate. These results are consistent across devices, demonstrating LASSI-EE's effectiveness across diverse hardware architectures.

2.6LGMar 24, 2024

Interpretable Modeling of Deep Reinforcement Learning Driven Scheduling

Boyang Li, Zhiling Lan, Michael E. Papka

In the field of high-performance computing (HPC), there has been recent exploration into the use of deep reinforcement learning for cluster scheduling (DRL scheduling), which has demonstrated promising outcomes. However, a significant challenge arises from the lack of interpretability in deep neural networks (DNN), rendering them as black-box models to system managers. This lack of model interpretability hinders the practical deployment of DRL scheduling. In this work, we present a framework called IRL (Interpretable Reinforcement Learning) to address the issue of interpretability of DRL scheduling. The core idea is to interpret DNN (i.e., the DRL policy) as a decision tree by utilizing imitation learning. Unlike DNN, decision tree models are non-parametric and easily comprehensible to humans. To extract an effective and efficient decision tree, IRL incorporates the Dataset Aggregation (DAgger) algorithm and introduces the notion of critical state to prune the derived decision tree. Through trace-based experiments, we demonstrate that IRL is capable of converting a black-box DNN policy into an interpretable rulebased decision tree while maintaining comparable scheduling performance. Additionally, IRL can contribute to the setting of rewards in DRL scheduling.

4.3DCMay 16, 2021

DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster Scheduling

Yuping Fan, Zhiling Lan

For decades, system administrators have been striving to design and tune cluster scheduling policies to improve the performance of high performance computing (HPC) systems. However, the increasingly complex HPC systems combined with highly diverse workloads make such manual process challenging, time-consuming, and error-prone. We present a reinforcement learning based HPC scheduling framework named DRAS-CQSim to automatically learn optimal scheduling policy. DRAS-CQSim encapsulates simulation environments, agents, hyperparameter tuning options, and different reinforcement learning algorithms, which allows the system administrators to quickly obtain customized scheduling policies.

8.6DCFeb 11, 2021Code

Deep Reinforcement Agent for Scheduling in HPC

Yuping Fan, Zhiling Lan, Taylor Childers et al.

Cluster scheduler is crucial in high-performance computing (HPC). It determines when and which user jobs should be allocated to available system resources. Existing cluster scheduling heuristics are developed by human experts based on their experience with specific HPC systems and workloads. However, the increasing complexity of computing systems and the highly dynamic nature of application workloads have placed tremendous burden on manually designed and tuned scheduling heuristics. More aggressive optimization and automation are needed for cluster scheduling in HPC. In this work, we present an automated HPC scheduling agent named DRAS (Deep Reinforcement Agent for Scheduling) by leveraging deep reinforcement learning. DRAS is built on a novel, hierarchical neural network incorporating special HPC scheduling features such as resource reservation and backfilling. A unique training strategy is presented to enable DRAS to rapidly learn the target environment. Once being provided a specific scheduling objective given by system manager, DRAS automatically learns to improve its policy through interaction with the scheduling environment and dynamically adjusts its policy as workload changes. The experiments with different production workloads demonstrate that DRAS outperforms the existing heuristic and optimization approaches by up to 45%.

3.3LGNov 12, 2020

Performance and Power Modeling and Prediction Using MuMMI and Ten Machine Learning Methods

Xingfu Wu, Valerie Taylor, Zhiling Lan

In this paper, we use modeling and prediction tool MuMMI (Multiple Metrics Modeling Infrastructure) and ten machine learning methods to model and predict performance and power and compare their prediction error rates. We use a fault-tolerant linear algebra code and a fault-tolerant heat distribution code to conduct our modeling and prediction study on the Cray XC40 Theta and IBM BG/Q Mira at Argonne National Laboratory and the Intel Haswell cluster Shepard at Sandia National Laboratories. Our experiment results show that the prediction error rates in performance and power using MuMMI are less than 10% for most cases. Based on the models for runtime, node power, CPU power, and memory power, we identify the most significant performance counters for potential optimization efforts associated with the application characteristics and the target architectures, and we predict theoretical outcomes of the potential optimizations. When we compare the prediction accuracy using MuMMI with that using 10 machine learning methods, we observe that MuMMI not only results in more accurate prediction in both performance and power but also presents how performance counters impact the performance and power models. This provides some insights about how to fine-tune the applications and/or systems for energy efficiency.