Wei-Fen Lin

24.8ARMay 21

ACALSim: A Scalable Parallel Simulation Framework for High-Performance System Design Space Exploration

Wei-Fen Lin, Jen-Chien Chang, Yen-Po Chen et al.

Architectural simulation has become the critical bottleneck limiting design space exploration for high-performance computing systems. Modern GPUs and AI accelerators -- with hundreds to thousands of tightly-coupled components -- demand simulation frameworks that deliver efficient parallelism and scalable single-node execution. Existing frameworks fall short: SST focuses on multi-node MPI scalability but struggles with intra-node scaling, while GPGPU-Sim remains largely single-threaded. Critically, none expose a mechanism for users to optimize threading for their specific workloads. We introduce ACALSim, a scalable parallel simulation framework providing infrastructure and APIs for building high-performance simulators -- timing-model accuracy remains the responsibility of simulator developers. Its key innovation is a pluggable thread-management architecture that lets developers implement custom scheduling strategies tailored to specific simulation patterns, absent in existing frameworks. Complementing it are (1) event-driven execution with fast-forward to eliminate idle-cycle overhead, (2) a shared-memory data model enabling zero-copy communication, and (3) a two-phase parallel execution model for deterministic thread scaling. We demonstrate ACALSim through HPCSim, a GPU simulator targeting A100-class architectures. Against an SST implementation using identical shared timing cores to isolate framework overhead, ACALSim achieves over 14x speedup with 41% lower memory footprint; hardware validation confirms 0.72--1.22x cycle-count correlation with A100 measurements. While SST fails to complete 256+ thread-block workloads within practical time limits, ACALSim simulates full LLaMA transformer layers (single block) in 17.7 minutes for LLaMA-7B and 30.4 minutes for LLaMA-13B -- enabling design space exploration that SST cannot achieve.

CVMar 11, 2024

QuantTune: Optimizing Model Quantization with Adaptive Outlier-Driven Fine Tuning

Jiun-Man Chen, Yu-Hsuan Chao, Yu-Jie Wang et al.

Transformer-based models have gained widespread popularity in both the computer vision (CV) and natural language processing (NLP) fields. However, significant challenges arise during post-training linear quantization, leading to noticeable reductions in inference accuracy. Our study focuses on uncovering the underlying causes of these accuracy drops and proposing a quantization-friendly fine-tuning method, \textbf{QuantTune}. Firstly, our analysis revealed that, on average, 65\% of quantization errors result from the precision loss incurred by the dynamic range amplification effect of outliers across the target Transformer-based models. Secondly, \textbf{QuantTune} adjusts weights based on the deviation of outlier activations and effectively constrains the dynamic ranges of the problematic activations. As a result, it successfully mitigates the negative impact of outliers on the inference accuracy of quantized models. Lastly, \textbf{QuantTune} can be seamlessly integrated into the back-propagation pass in the fine-tuning process without requiring extra complexity in inference software and hardware design. Our approach showcases significant improvements in post-training quantization across a range of Transformer-based models, including ViT, Bert-base, and OPT. QuantTune reduces accuracy drops by 12.09\% at 8-bit quantization and 33.8\% at 7-bit compared to top calibration methods, outperforming state-of-the-art solutions by over 18.84\% across ViT models.

Wei-Fen Lin

2 Papers