Joongun Park

h-index4

4papers

9citations

Novelty57%

AI Score47

Ranked #31,842 of 194,257 authors (top 16%)#133 in DC (top 14%)

4 Papers

11.1DCMay 11Code

MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

Srinivas Sridharan, Andy Balogh, Bradford M. Beckmann et al.

The fast pace of artificial intelligence~(AI) innovation demands an agile methodology for observation, reproduction and optimization of distributed machine learning~(ML) workload behavior in production AI systems and enables efficient software-hardware~(SW-HW) co-design for future systems. We present Chakra, an open and portable ecosystem for performance benchmarking and co-design. The core component of Chakra is an open and interoperable graph-based representation of distributed AI/ML workloads, called Chakra execution trace~(ET). These ETs represent key operations, such as compute, memory, and communication, data and control dependencies, timing, and resource constraints. Additionally, Chakra includes a complementary set of tools and capabilities to enable the collection, analysis, generation, and adoption of Chakra ETs by a broad range of simulators, emulators, and replay tools. We present analysis of Chakra ETs collected on production AI clusters and demonstrate value via real-world case studies. Chakra has been adopted by MLCommons and has active contributions and engagement across the industry, including but not limited to NVIDIA, AMD, Meta, Keysight, HPE, and Scala, to name a few.

1.2DCNov 13, 2025Code

STAGE: A Symbolic Tensor grAph GEnerator for distributed AI system co-design

Changhai Man, Joongun Park, Hanjiang Wu et al.

Optimizing the performance of large language models (LLMs) on large-scale AI training and inference systems requires a scalable and expressive mechanism to model distributed workload execution. Such modeling is essential for pre-deployment system-level optimizations (e.g., parallelization strategies) and design-space explorations. While recent efforts have proposed collecting execution traces from real systems, access to large-scale infrastructure remains limited to major cloud providers. Moreover, traces obtained from existing platforms cannot be easily adapted to study future larger-scale system configurations. We introduce Symbolic Tensor grAph GEnerator(STAGE), a framework that synthesizes high-fidelity execution traces to accurately model LLM workloads. STAGE supports a comprehensive set of parallelization strategies, allowing users to systematically explore a wide spectrum of LLM architectures and system configurations. STAGE demonstrates its scalability by synthesizing high-fidelity LLM traces spanning over 32K GPUs, while preserving tensor-level accuracy in compute, memory, and communication. STAGE is publicly available to facilitate further research in distributed machine learning systems: https://github.com/astra-sim/symbolic tensor graph

5.9ARApr 27, 2025

NSFlow: An End-to-End FPGA Framework with Scalable Dataflow Architecture for Neuro-Symbolic AI

Hanchen Yang, Zishen Wan, Ritik Raj et al.

Neuro-Symbolic AI (NSAI) is an emerging paradigm that integrates neural networks with symbolic reasoning to enhance the transparency, reasoning capabilities, and data efficiency of AI systems. Recent NSAI systems have gained traction due to their exceptional performance in reasoning tasks and human-AI collaborative scenarios. Despite these algorithmic advancements, executing NSAI tasks on existing hardware (e.g., CPUs, GPUs, TPUs) remains challenging, due to their heterogeneous computing kernels, high memory intensity, and unique memory access patterns. Moreover, current NSAI algorithms exhibit significant variation in operation types and scales, making them incompatible with existing ML accelerators. These challenges highlight the need for a versatile and flexible acceleration framework tailored to NSAI workloads. In this paper, we propose NSFlow, an FPGA-based acceleration framework designed to achieve high efficiency, scalability, and versatility across NSAI systems. NSFlow features a design architecture generator that identifies workload data dependencies and creates optimized dataflow architectures, as well as a reconfigurable array with flexible compute units, re-organizable memory, and mixed-precision capabilities. Evaluating across NSAI workloads, NSFlow achieves 31x speedup over Jetson TX2, more than 2x over GPU, 8x speedup over TPU-like systolic array, and more than 3x over Xilinx DPU. NSFlow also demonstrates enhanced scalability, with only 4x runtime increase when symbolic workloads scale by 150x. To the best of our knowledge, NSFlow is the first framework to enable real-time generalizable NSAI algorithms acceleration, demonstrating a promising solution for next-generation cognitive systems.

3.8CRAug 26, 2021

Stockade: Hardware Hardening for Distributed Trusted Sandboxes

Joongun Park, Seunghyo Kang, Sanghyeon Lee et al.

The widening availability of hardware-based trusted execution environments (TEEs) has been accelerating the adaptation of new applications using TEEs. Recent studies showed that a cloud application consists of multiple distributed software modules provided by mutually distrustful parties. The applications use multiple TEEs (enclaves) communicating through software-encrypted memory channels. Such execution model requires bi-directional protection: protecting the rest of the system from the enclave module with sandboxing and protecting the enclave module from a third-part module and operating systems. However, the current TEE model, such as Intel SGX, cannot efficiently represent such distributed sandbox applications. To overcome the lack of hardware supports for sandboxed TEEs, this paper proposes an extended enclave model called Stockade, which supports distributed sandboxes hardened by hardware. Stockade proposes new three key techniques. First, it extends the hardware-based memory isolation in SGX to confine a user software module only within its enclave. Second, it proposes a trusted monitor enclave that filters and validates systems calls from enclaves. Finally, it allows hardware-protected memory sharing between a pair of enclaves for efficient protected communication without software-based encryption. Using an emulated SGX platform with the proposed extensions, this paper shows that distributed sandbox applications can be effectively supported with small changes of SGX hardware.