LGJul 4, 2022
CPrune: Compiler-Informed Model Pruning for Efficient Target-Aware DNN ExecutionTaeho Kim, Yongin Kwon, Jemin Lee et al.
Mobile devices run deep learning models for various purposes, such as image classification and speech recognition. Due to the resource constraints of mobile devices, researchers have focused on either making a lightweight deep neural network (DNN) model using model pruning or generating an efficient code using compiler optimization. Surprisingly, we found that the straightforward integration between model compression and compiler auto-tuning often does not produce the most efficient model for a target device. We propose CPrune, a compiler-informed model pruning for efficient target-aware DNN execution to support an application with a required target accuracy. CPrune makes a lightweight DNN model through informed pruning based on the structural information of subgraphs built during the compiler tuning process. Our experimental results show that CPrune increases the DNN execution speed up to 2.73x compared to the state-of-the-art TVM auto-tune while satisfying the accuracy requirement.
CVMar 22, 2023
Q-HyViT: Post-Training Quantization of Hybrid Vision Transformers with Bridge Block Reconstruction for IoT SystemsJemin Lee, Yongin Kwon, Sihyeong Park et al.
Recently, vision transformers (ViTs) have superseded convolutional neural networks in numerous applications, including classification, detection, and segmentation. However, the high computational requirements of ViTs hinder their widespread implementation. To address this issue, researchers have proposed efficient hybrid transformer architectures that combine convolutional and transformer layers with optimized attention computation of linear complexity. Additionally, post-training quantization has been proposed as a means of mitigating computational demands. For mobile devices, achieving optimal acceleration for ViTs necessitates the strategic integration of quantization techniques and efficient hybrid transformer structures. However, no prior investigation has applied quantization to efficient hybrid transformers. In this paper, we discover that applying existing post-training quantization (PTQ) methods for ViTs to efficient hybrid transformers leads to a drastic accuracy drop, attributed to the four following challenges: (i) highly dynamic ranges, (ii) zero-point overflow, (iii) diverse normalization, and (iv) limited model parameters ($<$5M). To overcome these challenges, we propose a new post-training quantization method, which is the first to quantize efficient hybrid ViTs (MobileViTv1, MobileViTv2, Mobile-Former, EfficientFormerV1, EfficientFormerV2). We achieve a significant improvement of 17.73% for 8-bit and 29.75% for 6-bit on average, respectively, compared with existing PTQ methods (EasyQuant, FQ-ViT, PTQ4ViT, and RepQ-ViT)}. We plan to release our code at https://gitlab.com/ones-ai/q-hyvit.
LGApr 28Code
QFlash: Bridging Quantization and Memory Efficiency in Vision Transformer AttentionSehyeon Oh, Yongin Kwon, Jemin Lee
FlashAttention improves efficiency through tiling, but its online softmax still relies on floating-point arithmetic for numerical stability, making full quantization difficult. We identify three main obstacles to integer-only FlashAttention: (1) scale explosion during tile-wise accumulation, (2) inefficient shift-based exponential operations on GPUs, and (3) quantization granularity constraints requiring uniform scales for integer comparison. To address these challenges, we propose \textit{QFlash}, an end-to-end integer FlashAttention design that performs softmax entirely in the integer domain and runs as a single Triton kernel. On seven attention workloads from ViT, DeiT, and Swin models, QFlash achieves up to 6.73$\times$ speedup over I-ViT and up to 8.69$\times$ speedup on Swin, while reducing energy consumption by 18.8\% compared to FP16 FlashAttention, without sacrificing Top-1 accuracy on ViT/DeiT and remaining competitive on Swin under per-tensor quantization. Our code is publicly available at https://github.com/EfficientCompLab/qflash.
CLSep 17, 2024
Exploring the Trade-Offs: Quantization Methods, Task Difficulty, and Model Size in Large Language Models From Edge to GiantJemin Lee, Sihyeong Park, Jinse Kwon et al.
Quantization has gained attention as a promising solution for the cost-effective deployment of large and small language models. However, most prior work has been limited to perplexity or basic knowledge tasks and lacks a comprehensive evaluation of recent models like Llama-3.3. In this paper, we conduct a comprehensive evaluation of instruction-tuned models spanning 1B to 405B parameters, applying four quantization methods across 13 datasets. Our findings reveal that (1) quantized models generally surpass smaller FP16 baselines, yet they often struggle with instruction-following and hallucination detection; (2) FP8 consistently emerges as the most robust option across tasks, and AWQ tends to outperform GPTQ in weight-only quantization; (3) smaller models can suffer severe accuracy drops at 4-bit quantization, while 70B-scale models maintain stable performance; (4) notably, \textit{hard} tasks do not always experience the largest accuracy losses, indicating that quantization magnifies a model's inherent weaknesses rather than simply correlating with task difficulty; and (5) an LLM-based judge (MT-Bench) highlights significant performance declines in Coding and STEM tasks, though it occasionally reports improvements in reasoning.
CVJul 26, 2024
Mixed Non-linear Quantization for Vision TransformersGihwan Kim, Jemin Lee, Sihyeong Park et al.
The majority of quantization methods have been proposed to reduce the model size of Vision Transformers, yet most of them have overlooked the quantization of non-linear operations. Only a few works have addressed quantization for non-linear operations, but they applied a single quantization method across all non-linear operations. We believe that this can be further improved by employing a different quantization method for each non-linear operation. Therefore, to assign the most error-minimizing quantization method from the known methods to each non-linear layer, we propose a mixed non-linear quantization that considers layer-wise quantization sensitivity measured by SQNR difference metric. The results show that our method outperforms I-BERT, FQ-ViT, and I-ViT in both 8-bit and 6-bit settings for ViT, DeiT, and Swin models by an average of 0.6%p and 19.6%p, respectively. Our method outperforms I-BERT and I-ViT by 0.6%p and 20.8%p, respectively, when training time is limited. We plan to release our code at https://gitlab.com/ones-ai/mixed-non-linear-quantization.
CLMay 3, 2025Code
A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and EfficiencySihyeong Park, Sungryeol Jeon, Chaelyn Lee et al.
Large language models (LLMs) are widely applied in chatbots, code generators, and search engines. Workloads such as chain-of-thought, complex reasoning, and agent services significantly increase the inference cost by invoking the model repeatedly. Optimization methods such as parallelism, compression, and caching have been adopted to reduce costs, but the diverse service requirements make it hard to select the right method. Recently, specialized LLM inference engines have emerged as a key component for integrating the optimization methods into service-oriented infrastructures. However, a systematic study on inference engines is still lacking. This paper provides a comprehensive evaluation of 25 open-source and commercial inference engines. We examine each inference engine in terms of ease-of-use, ease-of-deployment, general-purpose support, scalability, and suitability for throughput- and latency-aware computation. Furthermore, we explore the design goals of each inference engine by investigating the optimization techniques it supports. In addition, we assess the ecosystem maturity of open source inference engines and handle the performance and cost policy of commercial solutions. We outline future research directions that include support for complex LLM-based services, support of various hardware, and enhanced security, offering practical guidance to researchers and developers in selecting and designing optimized LLM inference engines. We also provide a public repository to continually track developments in this fast-evolving field: https://github.com/sihyeong/Awesome-LLM-Inference-Engine
CVNov 19, 2025Code
IPTQ-ViT: Post-Training Quantization of Non-linear Functions for Integer-only Vision TransformersGihwan Kim, Jemin Lee, Hyungshin Kim
Previous Quantization-Aware Training (QAT) methods for vision transformers rely on expensive retraining to recover accuracy loss in non-linear layer quantization, limiting their use in resource-constrained environments. In contrast, existing Post-Training Quantization (PTQ) methods either partially quantize non-linear functions or adjust activation distributions to maintain accuracy but fail to achieve fully integer-only inference. In this paper, we introduce IPTQ-ViT, a novel PTQ framework for fully integer-only vision transformers without retraining. We present approximation functions: a polynomial-based GELU optimized for vision data and a bit-shifting-based Softmax designed to improve approximation accuracy in PTQ. In addition, we propose a unified metric integrating quantization sensitivity, perturbation, and computational cost to select the optimal approximation function per activation layer. IPTQ-ViT outperforms previous PTQ methods, achieving up to 6.44\%p (avg. 1.78\%p) top-1 accuracy improvement for image classification, 1.0 mAP for object detection. IPTQ-ViT outperforms partial floating-point PTQ methods under W8A8 and W4A8, and achieves accuracy and latency comparable to integer-only QAT methods. We plan to release our code https://github.com/gihwan-kim/IPTQ-ViT.git.
LGFeb 10, 2022Code
Quantune: Post-training Quantization of Convolutional Neural Networks using Extreme Gradient Boosting for Fast DeploymentJemin Lee, Misun Yu, Yongin Kwon et al.
To adopt convolutional neural networks (CNN) for a range of resource-constrained targets, it is necessary to compress the CNN models by performing quantization, whereby precision representation is converted to a lower bit representation. To overcome problems such as sensitivity of the training dataset, high computational requirements, and large time consumption, post-training quantization methods that do not require retraining have been proposed. In addition, to compensate for the accuracy drop without retraining, previous studies on post-training quantization have proposed several complementary methods: calibration, schemes, clipping, granularity, and mixed-precision. To generate a quantized model with minimal error, it is necessary to study all possible combinations of the methods because each of them is complementary and the CNN models have different characteristics. However, an exhaustive or a heuristic search is either too time-consuming or suboptimal. To overcome this challenge, we propose an auto-tuner known as Quantune, which builds a gradient tree boosting model to accelerate the search for the configurations of quantization and reduce the quantization error. We evaluate and compare Quantune with the random, grid, and genetic algorithms. The experimental results show that Quantune reduces the search time for quantization by approximately 36.5x with an accuracy loss of 0.07 ~ 0.65% across six CNN models, including the fragile ones (MobileNet, SqueezeNet, and ShuffleNet). To support multiple targets and adopt continuously evolving quantization works, Quantune is implemented on a full-fledged compiler for deep learning as an open-sourced project.
ITApr 25
Age of Information under Source-Aware Truncated ARQ in Multi-Source Wireless Status UpdatingTianci Zhang, Aobo Liu, Zhengchuan Chen et al.
This paper studies information timeliness in multi-source wireless Internet of Things (IoT) status updating systems under a truncated Automatic Repeat reQuest (ARQ) protocol. We propose a source-aware truncated ARQ (SATARQ) scheme that allows differentiated maximum transmission times (MTTs) tailored to different sources. This work focuses on a wireless system with preemptive update management. To study the statistical characteristics of the age of information (AoI) process for each source, a multi-dimensional age process (MDAP) is developed and modeled as a Markov chain, tracking both the AoI and the age of the concerned source's update currently in transmission. Via Markov analysis of the MDAP, we obtain analytical expressions for the distributions and averages of the AoI and peak AoI, as well as the average power consumption of IoT device. The timeliness-energy tradeoff is analyzed by examining the impact of the MTT, update generation probability (UGP), and wireless transmission power (TP). Moreover, this work explores the energy efficiency of the wireless status updating process and its relationship with the information timeliness and energy cost. Numerical results validate the theoretical analysis. Finally, it is demonstrated that the proposed SATARQ, combined with the optimization of MTTs, UGPs, and TPs, significantly improves the overall timeliness-energy tradeoff and energy efficiency across all sources.
NIApr 25, 2025
LLM-hRIC: LLM-empowered Hierarchical RAN Intelligent Control for O-RANLingyan Bao, Sinwoong Yun, Jemin Lee et al.
Despite recent advances in applying large language models (LLMs) and machine learning (ML) techniques to open radio access network (O-RAN), critical challenges remain, such as insufficient cooperation between radio access network (RAN) intelligent controllers (RICs), high computational demands hindering real-time decisions, and the lack of domain-specific finetuning. Therefore, this article introduces the LLM-empowered hierarchical RIC (LLM-hRIC) framework to improve the collaboration between RICs in O-RAN. The LLM-empowered non-real-time RIC (non-RT RIC) acts as a guider, offering a strategic guidance to the near-real-time RIC (near-RT RIC) using global network information. The RL-empowered near-RT RIC acts as an implementer, combining this guidance with local real-time data to make near-RT decisions. We evaluate the feasibility and performance of the LLM-hRIC framework in an integrated access and backhaul (IAB) network setting, and finally, discuss the open challenges of the LLM-hRIC framework for O-RAN.
AIJan 13, 2025
QuantuneV2: Compiler-Based Local Metric-Driven Mixed Precision Quantization for Practical Embedded AI ApplicationsJeongseok Kim, Jemin Lee, Yongin Kwon et al.
Mixed-precision quantization methods have been proposed to reduce model size while minimizing accuracy degradation. However, existing studies require retraining and do not consider the computational overhead and intermediate representations (IR) generated during the compilation process, limiting their application at the compiler level. This computational overhead refers to the runtime latency caused by frequent quantization and dequantization operations during inference. Performing these operations at the individual operator level causes significant runtime delays. To address these issues, we propose QuantuneV2, a compiler-based mixed-precision quantization method designed for practical embedded AI applications. QuantuneV2 performs inference only twice, once before quantization and once after quantization, and operates with a computational complexity of O(n) that increases linearly with the number of model parameters. We also made the sensitivity analysis more stable by using local metrics like weights, activation values, the Signal to Quantization Noise Ratio, and the Mean Squared Error. We also cut down on computational overhead by choosing the best IR and using operator fusion. Experimental results show that QuantuneV2 achieved up to a 10.28 percent improvement in accuracy and a 12.52 percent increase in speed compared to existing methods across five models: ResNet18v1, ResNet50v1, SqueezeNetv1, VGGNet, and MobileNetv2. This demonstrates that QuantuneV2 enhances model performance while maintaining computational efficiency, making it suitable for deployment in embedded AI environments.
LGNov 16, 2024
ML$^2$Tuner: Efficient Code Tuning via Multi-Level Machine Learning ModelsJooHyoung Cha, Munyoung Lee, Jinse Kwon et al.
The increasing complexity of deep learning models necessitates specialized hardware and software optimizations, particularly for deep learning accelerators. Existing autotuning methods often suffer from prolonged tuning times due to profiling invalid configurations, which can cause runtime errors. We introduce ML$^2$Tuner, a multi-level machine learning tuning technique that enhances autotuning efficiency by incorporating a validity prediction model to filter out invalid configurations and an advanced performance prediction model utilizing hidden features from the compilation process. Experimental results on an extended VTA accelerator demonstrate that ML$^2$Tuner achieves equivalent performance improvements using only 12.3% of the samples required with a similar approach as TVM and reduces invalid profiling attempts by an average of 60.8%, Highlighting its potential to enhance autotuning performance by filtering out invalid configurations
CROct 28, 2020
EC-SVC: Secure CAN Bus In-Vehicle Communications with Fine-grained Access Control Based on Edge ComputingDonghyun Yu, Ruei-Hau Hsu, Jemin Lee
In-vehicle communications are not designed for message exchange between the vehicles and outside systems originally. Thus, the security design of message protection is insufficient. Moreover, the internal devices do not have enough resources to process the additional security operations. Nonetheless, due to the characteristic of the in-vehicle network in which messages are broadcast, secure message transmission to specific receivers must be ensured. With consideration of the facts aforementioned, this work addresses resource problems by offloading secure operations to high-performance devices, and uses attribute-based access control to ensure the confidentiality of messages from attackers and unauthorized users. In addition, we reconfigure existing access control based cryptography to address new vulnerabilities arising from the use of edge computing and attribute-based access control. Thus, this paper proposes an edge computing-based security protocol with fine-grained attribute-based encryption using a hash function, symmetric-based cryptography, and reconfigured cryptographic scheme. In addition, this work formally proves the reconfigured cryptographic scheme and security protocol, and evaluates the feasibility of the proposed security protocol in various aspects using the CANoe software.
DCJun 4, 2020
Is Blockchain Suitable for Data Freshness? -- Age-of-Information PerspectiveSungho Lee, Minsu Kim, Jemin Lee et al.
Recent advances in blockchain have led to a significant interest in developing blockchain-based applications. While data can be retained in blockchains, the stored values can be deleted or updated. From a user viewpoint that searches for the data, it is unclear whether the discovered data from the blockchain storage is relevant for real-time decision-making process for blockchain-based application. The data freshness issue serves as a critical factor especially in dynamic networks handling real-time information. In general, transactions to renew the data require additional processing time inside the blockchain network, which is called ledger-commitment latency. Due to this problem, some users may receive outdated data. As a result, it is important to investigate if blockchain is suitable for providing real-time data services. In this article, we first describe blockchain-enabled (BCE) networks with Hyperledger Fabric (HLF). Then, we define age of information (AoI) of BCE networks and investigate the influential factors in this AoI. Analysis and experiments are conducted to support our proposed framework. Lastly, we conclude by discussing some future challenges.
CRJun 9, 2018
ReHand: Secure Region-based Fast Handover with User Anonymity for Small Cell Networks in 5GChun-I Fan, Jheng-Jia Huang, Min-Zhe Zhong et al.
Due to the expectedly higher density of mobile devices and exhaust of radio resources, the fifth generation (5G) mobile networks introduce small cell concept in the radio access technologies, so-called Small Cell Networks (SCNs), to improve radio spectrum utilization. However, this increases the chance of handover due to smaller coverage of a micro base station, i.e., home eNodeB (HeNB) in 5G. Subsequently, the latency will increase as the costs of authenticated key exchange protocol, which ensures entity authentication and communication confidentiality for secure handover, also increase totally. Thus, this work presents a secure region-based handover scheme (ReHand) with user anonymity and fast revocation for SCNs in 5G. ReHand greatly reduces the communication costs when UEs roam between small cells within the region of a macro base station, i.e., eNB in 5G, and the computation costs due to the employment of symmetry-based cryptographic operations. Compared to the three elaborated related works, ReHand dramatically reduces the costs from 82.92% to 99.99%. Nevertheless, this work demonstrates the security of ReHand by theoretically formal proofs.
CRSep 19, 2017
Reconfigurable Security: Edge Computing-based Framework for IoTRuei-Hau Hsu, Jemin Lee, Tony Q. S. Quek et al.
In various scenarios, achieving security between IoT devices is challenging since the devices may have different dedicated communication standards, resource constraints as well as various applications. In this article, we first provide requirements and existing solutions for IoT security. We then introduce a new reconfigurable security framework based on edge computing, which utilizes a near-user edge device, i.e., security agent, to simplify key management and offload the computational costs of security algorithms at IoT devices. This framework is designed to overcome the challenges including high computation costs, low flexibility in key management, and low compatibility in deploying new security algorithms in IoT, especially when adopting advanced cryptographic primitives. We also provide the design principles of the reconfigurable security framework, the exemplary security protocols for anonymous authentication and secure data access control, and the performance analysis in terms of feasibility and usability. The reconfigurable security framework paves a new way to strength IoT security by edge computing.
CRMar 13, 2017
GRAAD: Group Anonymous and Accountable D2D Communication in Mobile NetworksRuei-Hau Hsu, Jemin Lee, Tony Q. S. Quek et al.
Device-to-Device (D2D) communication is mainly launched by the transmission requirements between devices for specific applications such as Proximity Services in Long-Term Evolution Advanced (LTE-A) networks, and each application will form a group of devices for the network-covered and network-absent D2D communications. Although there are many privacy concerns in D2D communication, they have not been well-addressed in current communication standards. This work introduces network-covered and network-absent authenticated key exchange protocols for D2D communications to guarantee accountable group anonymity, end-to-end security to network operators, as well as traceability and revocability for accounting and management requirements. We formally prove the security of those protocols, and also develop an analytic model to evaluate the quality of authentication protocols by authentication success rate in D2D communications. Besides, we implement the proposed protocols on android mobile devices to evaluate the computation costs of the protocols. We also evaluate the authentication success rate by the proposed analytic model and prove the correctness of the analytic model via simulation. Those evaluations show that the proposed protocols are feasible to the performance requirements of D2D communications.