Yeonhong Park

LG
h-index10
8papers
522citations
Novelty61%
AI Score52

8 Papers

LGAug 19, 2022
Ginex: SSD-enabled Billion-scale Graph Neural Network Training on a Single Machine via Provably Optimal In-memory Caching

Yeonhong Park, Sunhong Min, Jae W. Lee

Recently, Graph Neural Networks (GNNs) have been receiving a spotlight as a powerful tool that can effectively serve various inference tasks on graph structured data. As the size of real-world graphs continues to scale, the GNN training system faces a scalability challenge. Distributed training is a popular approach to address this challenge by scaling out CPU nodes. However, not much attention has been paid to disk-based GNN training, which can scale up the single-node system in a more cost-effective manner by leveraging high-performance storage devices like NVMe SSDs. We observe that the data movement between the main memory and the disk is the primary bottleneck in the SSD-based training system, and that the conventional GNN training pipeline is sub-optimal without taking this overhead into account. Thus, we propose Ginex, the first SSD-based GNN training system that can process billion-scale graph datasets on a single machine. Inspired by the inspector-executor execution model in compiler optimization, Ginex restructures the GNN training pipeline by separating sample and gather stages. This separation enables Ginex to realize a provably optimal replacement algorithm, known as Belady's algorithm, for caching feature vectors in memory, which account for the dominant portion of I/O accesses. According to our evaluation with four billion-scale graph datasets, Ginex achieves 2.11x higher training throughput on average (up to 2.67x at maximum) than the SSD-extended PyTorch Geometric.

LGFeb 16, 2024Code
Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs

Yeonhong Park, Jake Hyun, SangLyul Cho et al.

Recently, considerable efforts have been directed towards compressing Large Language Models (LLMs), which showcase groundbreaking capabilities across diverse applications but entail significant deployment costs due to their large sizes. Meanwhile, much less attention has been given to mitigating the costs associated with deploying multiple LLMs of varying sizes despite its practical significance. Thus, this paper introduces \emph{any-precision LLM}, extending the concept of any-precision DNN to LLMs. Addressing challenges in any-precision LLM, we propose a lightweight method for any-precision quantization of LLMs, leveraging a post-training quantization framework, and develop a specialized software engine for its efficient serving. As a result, our solution significantly reduces the high costs of deploying multiple, different-sized LLMs by overlaying LLMs quantized to varying bit-widths, such as 3, 4, ..., $n$ bits, into a memory footprint comparable to a single $n$-bit LLM. All the supported LLMs with varying bit-widths demonstrate state-of-the-art model quality and inference throughput, proving itself to be a compelling option for deployment of multiple, different-sized LLMs. Our code is open-sourced and available online.

LGMay 11, 2025Code
GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance

Jinuk Kim, Marwa El Halabi, Wonpyo Park et al.

Post-training quantization is a key technique for reducing the memory and inference latency of large language models by quantizing weights and activations without requiring retraining. However, existing methods either (1) fail to account for the varying importance of hidden features to the end loss or, when incorporating end loss, (2) neglect the critical interactions between model weights. To address these limitations, we propose GuidedQuant, a novel quantization approach that integrates gradient information from the end loss into the quantization objective while preserving cross-weight dependencies within output channels. GuidedQuant consistently boosts the performance of state-of-the-art quantization methods across weight-only scalar, weight-only vector, and weight-and-activation quantization. Additionally, we introduce a novel non-uniform scalar quantization algorithm, which is guaranteed to monotonically decrease the quantization objective value, and outperforms existing methods in this category. We release the code at https://github.com/snu-mllab/GuidedQuant.

LGAug 8, 2025
DP-LLM: Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment

Sangwoo Kwon, Seong Hoon Seo, Jae W. Lee et al.

How can we effectively handle queries for on-device large language models (LLMs) with varying runtime constraints, such as latency and accuracy? Multi-scale quantization addresses this challenge by enabling memory-efficient runtime model adaptation of LLMs through the overlaying of multiple model variants quantized to different bitwidths. Meanwhile, an important question still remains open-ended: how can models be properly configured to match a target precision or latency? While mixed-precision offers a promising solution, we take this further by leveraging the key observation that the sensitivity of each layer dynamically changes across decoding steps. Building on this insight, we introduce DP-LLM, a novel mechanism that dynamically assigns precision to each layer based on input values. Experimental results across multiple models and benchmarks demonstrate that DP-LLM achieves a superior performance-latency trade-off, outperforming prior approaches.

LGDec 28, 2024
DecDEC: A Systems Approach to Advancing Low-Bit LLM Quantization

Yeonhong Park, Jake Hyun, Hojoon Kim et al.

Quantization of Large Language Models (LLMs) has recently gained popularity, particularly for on-device settings with limited hardware resources. While efficient, quantization inevitably degrades model quality, especially in aggressive low-bit settings such as 3-bit and 4-bit precision. In this paper, we propose DecDEC, an inference scheme that improves the quality of low-bit LLMs while preserving the key benefits of quantization: GPU memory savings and latency reduction. DecDEC stores the residual matrix -- the difference between full-precision and quantized weights -- in CPU, and dynamically fetches the residuals for only a small portion of the weights. This portion corresponds to the salient channels, marked by activation outliers, with the fetched residuals helping to correct quantization errors in these channels. Salient channels are identified dynamically at each decoding step by analyzing the input activations -- this enables adaptation to the dynamic nature of activation distribution, thus maximizing the effectiveness of error compensation. We demonstrate the effectiveness of DecDEC by augmenting state-of-the-art quantization methods. For example, DecDEC reduces the perplexity of a 3-bit Llama-3-8B-Instruct model from 10.15 to 9.12 -- outperforming its 3.5-bit counterpart -- while adding less than 0.0003\% to GPU memory usage and incurring only a 1.7\% inference slowdown on NVIDIA RTX 4050 Mobile.

LGFeb 15
MAGE: All-[MASK] Block Already Knows Where to Look in Diffusion LLM

Omin Kwon, Yeonjae Kim, Doyeon Kim et al.

Block diffusion LLMs are emerging as a promising next paradigm for language generation, but their use of KV caching makes memory access a dominant bottleneck in long-context settings. While dynamic sparse attention has been actively explored, existing methods designed for autoregressive LLMs rely on approximate importance estimation and perform poorly when adapted to block diffusion. This work identifies a key opportunity unique to block diffusion: attention at the first All-[MASK] denoising step reliably predicts important KV entries and budget requirements, enabling MAGE to perform a single exact attention pass per block and reuse it for training-free sparse denoising. Across long-context benchmarks including LongBench and Needle-in-a-Haystack, MAGE achieves near-lossless accuracy with a fraction of the KV budget while delivering up to 3-4x end-to-end speedup, consistently outperforming AR-oriented sparse attention baselines. A lightweight fine-tuning strategy further strengthens [MASK]-guided patterns with minimal cost, requiring only a few hours of training on a single NVIDIA H100 GPU for both 1.5B and 7B models.

CRAug 15, 2021
Mithril: Cooperative Row Hammer Protection on Commodity DRAM Leveraging Managed Refresh

Michael Jaemin Kim, Jaehyun Park, Yeonhong Park et al.

Since its public introduction in the mid-2010s, the Row Hammer (RH) phenomenon has drawn significant attention from the research community due to its security implications. Although many RH-protection schemes have been proposed by processor vendors, DRAM manufacturers, and academia, they still have shortcomings. Solutions implemented in the memory controller (MC) incur increasingly higher costs due to their conservative design for the worst case in terms of the number of DRAM banks and RH threshold to support. Meanwhile, DRAM-side implementation either has a limited time margin for RH-protection measures or requires extensive modifications to the standard DRAM interface. Recently, a new command for RH-protection has been introduced in the DDR5/LPDDR5 standards, referred to as refresh management (RFM). RFM enables the separation of the tasks for RHprotection to both MC and DRAM by having the former generate an RFM command at a specific activation frequency and the latter take proper RH-protection measures within a given time window. Although promising, no existing study presents and analyzes RFM-based solutions for RH-protection. In this paper, we propose Mithril, the first RFM interfacecompatible, DRAM-MC cooperative RH-protection scheme providing deterministic protection guarantees. Mithril has minimal energy overheads for common use cases without adversarial memory access patterns. We also introduce Mithril+, an optional extension to provide minimal performance overheads at the expense of a tiny modification to the MC, while utilizing existing DRAM commands.

DCFeb 22, 2020
A$^3$: Accelerating Attention Mechanisms in Neural Networks with Approximation

Tae Jun Ham, Sung Jun Jung, Seonghak Kim et al.

With the increasing computational demands of neural networks, many hardware accelerators for the neural networks have been proposed. Such existing neural network accelerators often focus on popular neural network types such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs); however, not much attention has been paid to attention mechanisms, an emerging neural network primitive that enables neural networks to retrieve most relevant information from a knowledge-base, external memory, or past states. The attention mechanism is widely adopted by many state-of-the-art neural networks for computer vision, natural language processing, and machine translation, and accounts for a large portion of total execution time. We observe today's practice of implementing this mechanism using matrix-vector multiplication is suboptimal as the attention mechanism is semantically a content-based search where a large portion of computations ends up not being used. Based on this observation, we design and architect A3, which accelerates attention mechanisms in neural networks with algorithmic approximation and hardware specialization. Our proposed accelerator achieves multiple orders of magnitude improvement in energy efficiency (performance/watt) as well as substantial speedup over the state-of-the-art conventional hardware.