CLJun 4, 2023Code
OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language ModelsChanghun Lee, Jungyu Jin, Taesu Kim et al.
Large language models (LLMs) with hundreds of billions of parameters require powerful server-grade GPUs for inference, limiting their practical deployment. To address this challenge, we introduce the outlier-aware weight quantization (OWQ) method, which aims to minimize LLM's footprint through low-precision representation. OWQ prioritizes a small subset of structured weights sensitive to quantization, storing them in high-precision, while applying highly tuned quantization to the remaining dense weights. This sensitivity-aware mixed-precision scheme reduces the quantization error notably, and extensive experiments demonstrate that 3.1-bit models using OWQ perform comparably to 4-bit models optimized by OPTQ. Furthermore, OWQ incorporates a parameter-efficient fine-tuning for task-specific adaptation, called weak column tuning (WCT), enabling accurate task-specific LLM adaptation with minimal memory overhead in the optimized format. OWQ represents a notable advancement in the flexibility, efficiency, and practicality of LLM optimization literature. The source code is available at https://github.com/xvyaward/owq
LGJul 3, 2023
Squeezing Large-Scale Diffusion Models for MobileJiwoong Choi, Minkyu Kim, Daehyun Ahn et al.
The emergence of diffusion models has greatly broadened the scope of high-fidelity image synthesis, resulting in notable advancements in both practical implementation and academic research. With the active adoption of the model in various real-world applications, the need for on-device deployment has grown considerably. However, deploying large diffusion models such as Stable Diffusion with more than one billion parameters to mobile devices poses distinctive challenges due to the limited computational and memory resources, which may vary according to the device. In this paper, we present the challenges and solutions for deploying Stable Diffusion on mobile devices with TensorFlow Lite framework, which supports both iOS and Android devices. The resulting Mobile Stable Diffusion achieves the inference latency of smaller than 7 seconds for a 512x512 image generation on Android devices with mobile GPUs.
CVJun 4, 2023
Temporal Dynamic Quantization for Diffusion ModelsJunhyuk So, Jungwon Lee, Daehyun Ahn et al.
The diffusion model has gained popularity in vision applications due to its remarkable generative performance and versatility. However, high storage and computation demands, resulting from the model size and iterative generation, hinder its use on mobile devices. Existing quantization techniques struggle to maintain performance even in 8-bit precision due to the diffusion model's unique property of temporal variation in activation. We introduce a novel quantization method that dynamically adjusts the quantization interval based on time step information, significantly improving output quality. Unlike conventional dynamic quantization techniques, our approach has no computational overhead during inference and is compatible with both post-training quantization (PTQ) and quantization-aware training (QAT). Our extensive experiments demonstrate substantial improvements in output quality with the quantized diffusion model across various datasets.
CVApr 15, 2022
INSTA-BNN: Binary Neural Network with INSTAnce-aware ThresholdChanghun Lee, Hyungjun Kim, Eunhyeok Park et al.
Binary Neural Networks (BNNs) have emerged as a promising solution for reducing the memory footprint and compute costs of deep neural networks, but they suffer from quality degradation due to the lack of freedom as activations and weights are constrained to the binary values. To compensate for the accuracy drop, we propose a novel BNN design called Binary Neural Network with INSTAnce-aware threshold (INSTA-BNN), which controls the quantization threshold dynamically in an input-dependent or instance-aware manner. According to our observation, higher-order statistics can be a representative metric to estimate the characteristics of the input distribution. INSTA-BNN is designed to adjust the threshold dynamically considering various information, including higher-order statistics, but it is also optimized judiciously to realize minimal overhead on a real device. Our extensive study shows that INSTA-BNN outperforms the baseline by 3.0% and 2.8% on the ImageNet classification task with comparable computing cost, achieving 68.5% and 72.2% top-1 accuracy on ResNet-18 and MobileNetV1 based models, respectively.
LGMay 23
High-fidelity Modeling of Full-scale Pressurized Water Reactor Flow Fields for Machine Learning ApplicationsLogan A. Burnett, Hyungjun Kim, Hsien-Cheng Chou et al.
This work presents a high-fidelity computational fluid dynamics (CFD) and data-driven modeling framework for assembly-level flow characterization in a four-loop pressurized water reactor (PWR). A full lower-plenum and core-inlet domain was constructed using publicly available geometry and operating conditions, enabling transient simulations with pump-induced swirl boundary conditions. The results show that cold-leg swirl and lower-plenum transport generate strongly heterogeneous assembly-wise inlet flow distributions, particularly near the lower core region, while axial resistance and mixing progressively homogenize the flow at higher elevations. These physics-informed datasets were subsequently used to evaluate machine learning (ML) applications for partial field reconstruction and short-term autoregressive prediction. A 3D convolutional-based inpainting model successfully recon-structed missing assembly-level mass flow rates from partial observations, with errors concentrated in the highly turbulent base (bottom) layer and diminishing significantly in upper layers. Comparative analysis across multiple ML models demon-strates that spatially aware architectures, particularly ConvLSTM, significantly outperform sequence-based (LSTM) and operator-learning (DeepONet) approaches by effectively capturing coupled spatio-temporal dynamics. The study also high-lights key challenges, including the sensitivity of inlet flow predictions to turbulence and mesh resolution, as well as the absence of full-scale experimental validation data. Despite these limitations, the results remain consistent with expected physical behavior. Overall, this work establishes high-fidelity CFD as a critical foundation for developing data-driven surrogates, sparse sensing strategies, and future multiphysics coupling frameworks.
CLFeb 14, 2024Code
SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer BlocksJiwon Song, Kyungseok Oh, Taesu Kim et al.
Large language models (LLMs) have proven to be highly effective across various natural language processing tasks. However, their large number of parameters poses significant challenges for practical deployment. Pruning, a technique aimed at reducing the size and complexity of LLMs, offers a potential solution by removing redundant components from the network. Despite the promise of pruning, existing methods often struggle to achieve substantial end-to-end LLM inference speedup. In this paper, we introduce SLEB, a novel approach designed to streamline LLMs by eliminating redundant transformer blocks. We choose the transformer block as the fundamental unit for pruning, because LLMs exhibit block-level redundancy with high similarity between the outputs of neighboring blocks. This choice allows us to effectively enhance the processing speed of LLMs. Our experimental results demonstrate that SLEB outperforms previous LLM pruning methods in accelerating LLM inference while also maintaining superior perplexity and accuracy, making SLEB as a promising technique for enhancing the efficiency of LLMs. The code is available at: https://github.com/jiwonsong-dev/SLEB.
LGMay 26, 2025Code
GraLoRA: Granular Low-Rank Adaptation for Parameter-Efficient Fine-TuningYeonjoon Jung, Daehyun Ahn, Hyungjun Kim et al.
Low-Rank Adaptation (LoRA) is a popular method for parameter-efficient fine-tuning (PEFT) of generative models, valued for its simplicity and effectiveness. Despite recent enhancements, LoRA still suffers from a fundamental limitation: overfitting when the bottleneck is widened. It performs best at ranks 32-64, yet its accuracy stagnates or declines at higher ranks, still falling short of full fine-tuning (FFT) performance. We identify the root cause as LoRA's structural bottleneck, which introduces gradient entanglement to the unrelated input channels and distorts gradient propagation. To address this, we introduce a novel structure, Granular Low-Rank Adaptation (GraLoRA) that partitions weight matrices into sub-blocks, each with its own low-rank adapter. With negligible computational or storage cost, GraLoRA overcomes LoRA's limitations, effectively increases the representational capacity, and more closely approximates FFT behavior. Experiments on code generation and commonsense reasoning benchmarks show that GraLoRA consistently outperforms LoRA and other baselines, achieving up to +8.5% absolute gain in Pass@1 on HumanEval+. These improvements hold across model sizes and rank settings, making GraLoRA a scalable and robust solution for PEFT. Code, data, and scripts are available at https://github.com/SqueezeBits/GraLoRA.git
NENov 6, 2018Code
Neural Network-Hardware Co-design for Scalable RRAM-based BNN AcceleratorsYulhwa Kim, Hyungjun Kim, Jae-Joon Kim
Recently, RRAM-based Binary Neural Network (BNN) hardware has been gaining interests as it requires 1-bit sense-amp only and eliminates the need for high-resolution ADC and DAC. However, RRAM-based BNN hardware still requires high-resolution ADC for partial sum calculation to implement large-scale neural network using multiple memory arrays. We propose a neural network-hardware co-design approach to split input to fit each split network on a RRAM array so that the reconstructed BNNs calculate 1-bit output neuron in each array. As a result, ADC can be completely eliminated from the design even for large-scale neural network. Simulation results show that the proposed network reconstruction and retraining recovers the inference accuracy of the original BNN. The accuracy loss of the proposed scheme in the CIFAR-10 testcase was less than 1.1% compared to the original network. The code for training and running proposed BNN models is available at: https://github.com/YulhwaKim/RRAMScalable_BNN.
MTRL-SCIMay 4
From Knowledge to Action: Outcomes of the 2025 Large Language Model (LLM) Hackathon for Applications in Materials Science and ChemistryAritra Roy, Kevin Shen, Andrew MacBride et al.
Large language models (LLMs) are rapidly changing how researchers in materials science and chemistry discover, organize, and act on scientific knowledge. This paper analyzes a broad set of community-developed LLM applications in an effort to identify emerging patterns in how these systems can be used across the scientific research lifecycle. We organize the projects into two complementary categories: Knowledge Infrastructure, systems that structure, retrieve, synthesize, and validate scientific information; and Action Systems, systems that execute, coordinate, or automate scientific work across computational and experimental environments. The submissions reveal a shift from single-purpose LLM tools toward integrated, multi-agent workflows that combine retrieval, reasoning, tool use, and domain-specific validation. Prominent themes include retrieval-augmented generation as grounding infrastructure, persistent structured knowledge representations, multimodal and multilingual scientific inputs, and early progress toward laboratory-integrated closed-loop systems. Together, these results suggest that LLMs are evolving from general-purpose assistants into composable infrastructure for scientific reasoning and action. This work provides a community snapshot of that transition and a practical taxonomy for understanding emerging LLM-enabled workflows in materials science and chemistry.
LGFeb 15, 2024
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inferenceTaesu Kim, Jongho Lee, Daehyun Ahn et al.
We introduce QUICK, a group of novel optimized CUDA kernels for the efficient inference of quantized Large Language Models (LLMs). QUICK addresses the shared memory bank-conflict problem of state-of-the-art mixed precision matrix multiplication kernels. Our method interleaves the quantized weight matrices of LLMs offline to skip the shared memory write-back after the dequantization. We demonstrate up to 1.91x speedup over existing kernels of AutoAWQ on larger batches and up to 1.94x throughput gain on representative LLM models on various NVIDIA GPU devices.
DCDec 31, 2024
Debunking the CUDA Myth Towards GPU-based AI SystemsYunjae Lee, Juntaek Lim, Jehyeon Bang et al.
This paper presents a comprehensive evaluation of Intel Gaudi NPUs as an alternative to NVIDIA GPUs, which is currently the de facto standard in AI system design. First, we create a suite of microbenchmarks to compare Intel Gaudi-2 with NVIDIA A100, showing that Gaudi-2 achieves competitive performance not only in primitive AI compute, memory, and communication operations but also in executing several important AI workloads end-to-end. We then assess Gaudi NPU's programmability by discussing several software-level optimization strategies to employ for implementing critical FBGEMM operators and vLLM, evaluating their efficiency against GPU-optimized counterparts. Results indicate that Gaudi-2 achieves energy efficiency comparable to A100, though there are notable areas for improvement in terms of software maturity. Overall, we conclude that, with effective integration into high-level AI frameworks, Gaudi NPUs could challenge NVIDIA GPU's dominance in the AI server market, though further improvements are necessary to fully compete with NVIDIA's robust software ecosystem.
CVJul 15, 2025
Modernizing CNN-based Weather Forecast Model towards Higher Computational EfficiencyMinjong Cheon, Eunhan Goo, Su-Hyeon Shin et al.
Recently, AI-based weather forecast models have achieved impressive advances. These models have reached accuracy levels comparable to traditional NWP systems, marking a significant milestone in data-driven weather prediction. However, they mostly leverage Transformer-based architectures, which often leads to high training complexity and resource demands due to the massive parameter sizes. In this study, we introduce a modernized CNN-based model for global weather forecasting that delivers competitive accuracy while significantly reducing computational requirements. To present a systematic modernization roadmap, we highlight key architectural enhancements across multiple design scales from an earlier CNN-based approach. KAI-a incorporates a scale-invariant architecture and InceptionNeXt-based blocks within a geophysically-aware design, tailored to the structure of Earth system data. Trained on the ERA5 daily dataset with 67 atmospheric variables, the model contains about 7 million parameters and completes training in just 12 hours on a single NVIDIA L40s GPU. Our evaluation shows that KAI-a matches the performance of state-of-the-art models in medium-range weather forecasting, while offering a significantly lightweight design. Furthermore, case studies on the 2018 European heatwave and the East Asian summer monsoon demonstrate KAI-a's robust skill in capturing extreme events, reinforcing its practical utility.
LGDec 2, 2020
Improving Accuracy of Binary Neural Networks using Unbalanced Activation DistributionHyungjun Kim, Jihoon Park, Changhun Lee et al.
Binarization of neural network models is considered as one of the promising methods to deploy deep neural network models on resource-constrained environments such as mobile devices. However, Binary Neural Networks (BNNs) tend to suffer from severe accuracy degradation compared to the full-precision counterpart model. Several techniques were proposed to improve the accuracy of BNNs. One of the approaches is to balance the distribution of binary activations so that the amount of information in the binary activations becomes maximum. Based on extensive analysis, in stark contrast to previous work, we argue that unbalanced activation distribution can actually improve the accuracy of BNNs. We also show that adjusting the threshold values of binary activation functions results in the unbalanced distribution of the binary activation, which increases the accuracy of BNN models. Experimental results show that the accuracy of previous BNN models (e.g. XNOR-Net and Bi-Real-Net) can be improved by simply shifting the threshold values of binary activation functions without requiring any other modification.
CLSep 22, 2020
SUMBT+LaRL: Effective Multi-domain End-to-end Neural Task-oriented Dialog SystemHwaran Lee, Seokhwan Jo, HyungJun Kim et al.
The recent advent of neural approaches for developing each dialog component in task-oriented dialog systems has remarkably improved, yet optimizing the overall system performance remains a challenge. Besides, previous research on modeling complicated multi-domain goal-oriented dialogs in end-to-end fashion has been limited. In this paper, we present an effective multi-domain end-to-end trainable neural dialog system SUMBT+LaRL that incorporates two previous strong models and facilitates them to be fully differentiable. Specifically, the SUMBT+ estimates user-acts as well as dialog belief states, and the LaRL models latent system action spaces and generates responses given the estimated contexts. We emphasize that the training framework of three steps significantly and stably increase dialog success rates: separately pretraining the SUMBT+ and LaRL, fine-tuning the entire system, and then reinforcement learning of dialog policy. We also introduce new reward criteria of reinforcement learning for dialog policy training. Then, we discuss experimental results depending on the reward criteria and different dialog evaluation methods. Consequently, our model achieved the new state-of-the-art success rate of 85.4% on corpus-based evaluation, and a comparable success rate of 81.40% on simulator-based evaluation provided by the DSTC8 challenge. To our best knowledge, our work is the first comprehensive study of a modularized E2E multi-domain dialog system that learning from each component to the entire dialog policy for task success.
LGSep 8, 2020
Empirical Strategy for Stretching Probability Distribution in Neural-network-based RegressionEunho Koo, Hyungjun Kim
In regression analysis under artificial neural networks, the prediction performance depends on determining the appropriate weights between layers. As randomly initialized weights are updated during back-propagation using the gradient descent procedure under a given loss function, the loss function structure can affect the performance significantly. In this study, we considered the distribution error, i.e., the inconsistency of two distributions (those of the predicted values and label), as the prediction error, and proposed weighted empirical stretching (WES) as a novel loss function to increase the overlap area of the two distributions. The function depends on the distribution of a given label, thus, it is applicable to any distribution shape. Moreover, it contains a scaling hyperparameter such that the appropriate parameter value maximizes the common section of the two distributions. To test the function capability, we generated ideal distributed curves (unimodal, skewed unimodal, bimodal, and skewed bimodal) as the labels, and used the Fourier-extracted input data from the curves under a feedforward neural network. In general, WES outperformed loss functions in wide use, and the performance was robust to the various noise levels. The improved results in RMSE for the extreme domain (i.e., both tail regions of the distribution) are expected to be utilized for prediction of abnormal events in non-linear complex systems such as natural disaster and financial crisis.
LGFeb 16, 2020
BinaryDuo: Reducing Gradient Mismatch in Binary Activation Network by Coupling Binary ActivationsHyungjun Kim, Kyungsu Kim, Jinseok Kim et al.
Binary Neural Networks (BNNs) have been garnering interest thanks to their compute cost reduction and memory savings. However, BNNs suffer from performance degradation mainly due to the gradient mismatch caused by binarizing activations. Previous works tried to address the gradient mismatch problem by reducing the discrepancy between activation functions used at forward pass and its differentiable approximation used at backward pass, which is an indirect measure. In this work, we use the gradient of smoothed loss function to better estimate the gradient mismatch in quantized neural network. Analysis using the gradient mismatch estimator indicates that using higher precision for activation is more effective than modifying the differentiable approximation of activation function. Based on the observation, we propose a new training scheme for binary activation networks called BinaryDuo in which two binary activations are coupled into a ternary activation during training. Experimental results show that BinaryDuo outperforms state-of-the-art BNNs on various benchmarks with the same amount of parameters and computing cost.
ETJul 24, 2019
Zero-shifting Technique for Deep Neural Network Training on Resistive Cross-point ArraysHyungjun Kim, Malte Rasch, Tayfun Gokmen et al.
A resistive memory device-based computing architecture is one of the promising platforms for energy-efficient Deep Neural Network (DNN) training accelerators. The key technical challenge in realizing such accelerators is to accumulate the gradient information without a bias. Unlike the digital numbers in software which can be assigned and accessed with desired accuracy, numbers stored in resistive memory devices can only be manipulated following the physics of the device, which can significantly limit the training performance. Therefore, additional techniques and algorithm-level remedies are required to achieve the best possible performance in resistive memory device-based accelerators. In this paper, we analyze asymmetric conductance modulation characteristics in RRAM by Soft-bound synapse model and present an in-depth analysis on the relationship between device characteristics and DNN model accuracy using a 3-layer DNN trained on the MNIST dataset. We show that the imbalance between up and down update leads to a poor network performance. We introduce a concept of symmetry point and propose a zero-shifting technique which can compensate imbalance by programming the reference device and changing the zero value point of the weight. By using this zero-shifting method, we show that network performance dramatically improves for imbalanced synapse devices.
NEMar 23, 2019
BitSplit-Net: Multi-bit Deep Neural Network with Bitwise Activation FunctionHyungjun Kim, Yulhwa Kim, Sungju Ryu et al.
Significant computational cost and memory requirements for deep neural networks (DNNs) make it difficult to utilize DNNs in resource-constrained environments. Binary neural network (BNN), which uses binary weights and binary activations, has been gaining interests for its hardware-friendly characteristics and minimal resource requirement. However, BNN usually suffers from accuracy degradation. In this paper, we introduce "BitSplit-Net", a neural network which maintains the hardware-friendly characteristics of BNN while improving accuracy by using multi-bit precision. In BitSplit-Net, each bit of multi-bit activations propagates independently throughout the network before being merged at the end of the network. Thus, each bit path of the BitSplit-Net resembles BNN and hardware friendly features of BNN, such as bitwise binary activation function, are preserved in our scheme. We demonstrate that the BitSplit version of LeNet-5, VGG-9, AlexNet, and ResNet-18 can be trained to have similar classification accuracy at a lower computational cost compared to conventional multi-bit networks with low bit precision (<= 4-bit). We further evaluate BitSplit-Net on GPU with custom CUDA kernel, showing that BitSplit-Net can achieve better hardware performance in comparison to conventional multi-bit networks.
ETMar 30, 2017
Deep Neural Network Optimized to Resistive Memory with Nonlinear Current-Voltage CharacteristicsHyungjun Kim, Taesu Kim, Jinseok Kim et al.
Artificial Neural Network computation relies on intensive vector-matrix multiplications. Recently, the emerging nonvolatile memory (NVM) crossbar array showed a feasibility of implementing such operations with high energy efficiency, thus there are many works on efficiently utilizing emerging NVM crossbar array as analog vector-matrix multiplier. However, its nonlinear I-V characteristics restrain critical design parameters, such as the read voltage and weight range, resulting in substantial accuracy loss. In this paper, instead of optimizing hardware parameters to a given neural network, we propose a methodology of reconstructing a neural network itself optimized to resistive memory crossbar arrays. To verify the validity of the proposed method, we simulated various neural network with MNIST and CIFAR-10 dataset using two different specific Resistive Random Access Memory (RRAM) model. Simulation results show that our proposed neural network produces significantly higher inference accuracies than conventional neural network when the synapse devices have nonlinear I-V characteristics.