Taekyung Heo

6papers

160citations

Novelty51%

AI Score49

Ranked #49,987 of 201,326 authors (top 25%)#220 in DC (top 19%)

6 Papers

DCMar 24, 2023Code

ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale

William Won, Taekyung Heo, Saeed Rashidi et al.

As deep learning models and input data are scaling at an unprecedented rate, it is inevitable to move towards distributed training platforms to fit the model and increase training throughput. State-of-the-art approaches and techniques, such as wafer-scale nodes, multi-dimensional network topologies, disaggregated memory systems, and parallelization strategies, have been actively adopted by emerging distributed training systems. This results in a complex SW/HW co-design stack of distributed training, necessitating a modeling/simulation infrastructure for design-space exploration. In this paper, we extend the open-source ASTRA-sim infrastructure and endow it with the capabilities to model state-of-the-art and emerging distributed training models and platforms. More specifically, (i) we enable ASTRA-sim to support arbitrary model parallelization strategies via a graph-based training-loop implementation, (ii) we implement a parameterizable multi-dimensional heterogeneous topology generation infrastructure with analytical performance estimates enabling simulating target systems at scale, and (iii) we enhance the memory system modeling to support accurate modeling of in-network collective communication and disaggregated memory systems. With such capabilities, we run comprehensive case studies targeting emerging distributed models and platforms. This infrastructure lets system designers swiftly traverse the complex co-design stack and give meaningful insights when designing and deploying distributed training platforms at scale.

DCNov 30, 2022

COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training

Divya Kiran Kadiyala, Saeed Rashidi, Taekyung Heo et al.

Modern Deep Learning (DL) models have grown to sizes requiring massive clusters of specialized, high-end nodes to train. Designing such clusters to maximize both performance and utilization--to amortize their steep cost--is a challenging task requiring careful balance of compute, memory, and network resources. Moreover, a plethora of each model's tuning knobs drastically affect the performance, with optimal values often depending on the underlying cluster's characteristics, which necessitates a complex cluster-workload co-design process. To facilitate the design space exploration of such massive DL training clusters, we introduce COMET, a holistic cluster design methodology and workflow to jointly study the impact of parallelization strategies and key cluster resource provisioning on the performance of distributed DL training. We develop a step-by-step process to establish a reusable and flexible methodology, and demonstrate its application with case studies of training large models on cluster configurations of variable compute, memory, and network resources. Our case studies demonstrate COMET's utility in identifying promising architectural optimization directions and guiding system designers in configuring key model and cluster parameters. To illustrate, cluster configuration comparisons identify performance differences of up to 7.7x and highlight performance optimization opportunities of up to 1.4x when employing memory expansion as an optimization technique.

9.4LGMay 20

Training distribution determines the ceiling of drug-blind cancer sensitivity prediction

Taekyung Heo

Precision oncology requires predicting which drugs will suppress a specific tumor from its molecular profile, but drug-blind sensitivity prediction has plateaued despite increasingly complex drug representations. Here we show that this stagnation reflects a metric artifact rather than a representational bottleneck. The standard benchmark, global Pearson r, is dominated by between-drug potency differences that a trivial drug-mean predictor captures without any cell-specific learning. Per-drug Pearson r, which isolates within-drug cell ranking, reveals that no drug encoding improves over cell-only features across four independent datasets. A controlled experiment channeling mechanism-of-action identity as either a drug feature or a training-distribution constraint identifies the cause. Supplying MoA as a feature yields negligible benefit, whereas using it to stratify training raises per-drug r substantially for targeted kinase inhibitors, because pan-cancer co-training suppresses pathway-specific sensitivity signals. Mechanism-stratified training and response matching from pilot observations provide two deployable strategies that together recover the principal sources of predictive gain in drug-blind sensitivity prediction.

69.7DCMay 11

MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

Srinivas Sridharan, Andy Balogh, Bradford M. Beckmann et al.

The fast pace of artificial intelligence~(AI) innovation demands an agile methodology for observation, reproduction and optimization of distributed machine learning~(ML) workload behavior in production AI systems and enables efficient software-hardware~(SW-HW) co-design for future systems. We present Chakra, an open and portable ecosystem for performance benchmarking and co-design. The core component of Chakra is an open and interoperable graph-based representation of distributed AI/ML workloads, called Chakra execution trace~(ET). These ETs represent key operations, such as compute, memory, and communication, data and control dependencies, timing, and resource constraints. Additionally, Chakra includes a complementary set of tools and capabilities to enable the collection, analysis, generation, and adoption of Chakra ETs by a broad range of simulators, emulators, and replay tools. We present analysis of Chakra ETs collected on production AI clusters and demonstrate value via real-world case studies. Chakra has been adopted by MLCommons and has active contributions and engagement across the industry, including but not limited to NVIDIA, AMD, Meta, Keysight, HPE, and Scala, to name a few.

LGMay 23, 2023Code

Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

Srinivas Sridharan, Taekyung Heo, Louis Feng et al.

Benchmarking and co-design are essential for driving optimizations and innovation around ML models, ML software, and next-generation hardware. Full workload benchmarks, e.g. MLPerf, play an essential role in enabling fair comparison across different software and hardware stacks especially once systems are fully designed and deployed. However, the pace of AI innovation demands a more agile methodology to benchmark creation and usage by simulators and emulators for future system co-design. We propose Chakra, an open graph schema for standardizing workload specification capturing key operations and dependencies, also known as Execution Trace (ET). In addition, we propose a complementary set of tools/capabilities to enable collection, generation, and adoption of Chakra ETs by a wide range of simulators, emulators, and benchmarks. For instance, we use generative AI models to learn latent statistical properties across thousands of Chakra ETs and use these models to synthesize Chakra ETs. These synthetic ETs can obfuscate key proprietary information and also target future what-if scenarios. As an example, we demonstrate an end-to-end proof-of-concept that converts PyTorch ETs to Chakra ETs and uses this to drive an open-source training system simulator (ASTRA-sim). Our end-goal is to build a vibrant industry-wide ecosystem of agile benchmarks and tools to drive future AI system co-design.

DCAug 25, 2021

Hardware-assisted Trusted Memory Disaggregation for Secure Far Memory

Taekyung Heo, Seunghyo Kang, Sanghyeon Lee et al.

Memory disaggregation provides efficient memory utilization across network-connected systems. It allows a node to use part of memory in remote nodes in the same cluster. Recent studies have improved RDMA-based memory disaggregation systems, supporting lower latency and higher bandwidth than the prior generation of disaggregated memory. However, the current disaggregated memory systems manage remote memory only at coarse granularity due to the limitation of the access validation mechanism of RDMA. In such systems, to support fine-grained remote page allocation, the trustworthiness of all participating systems needs to be assumed, and thus a security breach in a node can propagate to the entire cluster. From the security perspective, the memory-providing node must protect its memory from memory-requesting nodes. On the other hand, the memory-requesting node requires the confidentiality and integrity protection of its memory contents even if they are stored in remote nodes. To address the weak isolation support in the current system, this study proposes a novel hardware-assisted memory disaggregation system. Based on the security features of FPGA, the logic in each per-node FPGA board provides a secure memory disaggregation engine. With its own networks, a set of FPGA-based engines form a trusted memory disaggregation system, which is isolated from the privileged software of each participating node. The secure memory disaggregation system allows fine-grained memory management in memory-providing nodes, while the access validation is guaranteed with the hardware-hardened mechanism. In addition, the proposed system hides the memory access patterns observable from remote nodes, supporting obliviousness. Our evaluation with FPGA implementation shows that such fine-grained secure disaggregated memory is feasible with comparable performance to the latest software-based techniques.