CVMar 4, 2023
Training-Free Acceleration of ViTs with Delayed Spatial MergingJung Hwan Heo, Seyedarmin Azizi, Arash Fayyazi et al.
Token merging has emerged as a new paradigm that can accelerate the inference of Vision Transformers (ViTs) without any retraining or fine-tuning. To push the frontier of training-free acceleration in ViTs, we improve token merging by adding the perspectives of 1) activation outliers and 2) hierarchical representations. Through a careful analysis of the attention behavior in ViTs, we characterize a delayed onset of the convergent attention phenomenon, which makes token merging undesirable in the bottom blocks of ViTs. Moreover, we augment token merging with a hierarchical processing scheme to capture multi-scale redundancy between visual tokens. Combining these two insights, we build a unified inference framework called DSM: Delayed Spatial Merging. We extensively evaluate DSM on various ViT model scales (Tiny to Huge) and tasks (ImageNet-1k and transfer learning), achieving up to 1.8$\times$ FLOP reduction and 1.6$\times$ throughput speedup at a negligible loss while being two orders of magnitude faster than existing methods.
CVJun 30, 2022
Sparse Periodic Systolic Dataflow for Lowering Latency and Power Dissipation of Convolutional Neural Network AcceleratorsJung Hwan Heo, Arash Fayyazi, Amirhossein Esmaili et al.
This paper introduces the sparse periodic systolic (SPS) dataflow, which advances the state-of-the-art hardware accelerator for supporting lightweight neural networks. Specifically, the SPS dataflow enables a novel hardware design approach unlocked by an emergent pruning scheme, periodic pattern-based sparsity (PPS). By exploiting the regularity of PPS, our sparsity-aware compiler optimally reorders the weights and uses a simple indexing unit in hardware to create matches between the weights and activations. Through the compiler-hardware codesign, SPS dataflow enjoys higher degrees of parallelism while being free of the high indexing overhead and without model accuracy loss. Evaluated on popular benchmarks such as VGG and ResNet, the SPS dataflow and accompanying neural network compiler outperform prior work in convolutional neural network (CNN) accelerator designs targeting FPGA devices. Against other sparsity-supporting weight storage formats, SPS results in 4.49x energy efficiency gain while lowering storage requirements by 3.67x for total weight storage (non-pruned weights plus indexing) and 22,044x for indexing memory.
86.0LGApr 21
SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM ServingJinda Jia, Jisen Li, Zhongzhu Zhou et al.
KV-cache memory is a major bottleneck in real-world LLM serving, where systems must simultaneously support latency-sensitive small-batch requests and high-throughput concurrent workloads. Although many KV-cache compression methods improve offline accuracy or compression ratio, they often violate practical serving constraints such as paged memory layouts, regular memory access, and fused attention execution, limiting their effectiveness in deployment. In this work, we identify the minimal set of 4-bit KV-cache quantization methods that remain viable under these constraints. Our central finding is that a simple design--token-wise INT4 quantization with block-diagonal Hadamard rotation--consistently achieves the best accuracy-efficiency trade-off. Across multiple models and benchmarks, this approach recovers nearly all of the accuracy lost by naive INT4, while more complex methods such as vector quantization and Hessian-aware quantization provide only marginal additional gains once serving compatibility is taken into account. To make this practical, we implement a fused rotation-quantization kernel that integrates directly into paged KV-cache layouts and introduces zero measurable end-to-end overhead, matching plain INT4 throughput across concurrency levels. Our results show that effective KV-cache compression is fundamentally a systems co-design problem: under real serving constraints, lightweight block-diagonal Hadamard rotation is a viable method that delivers near-lossless accuracy without sacrificing serving efficiency.
LGSep 27, 2023Code
Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language ModelsJung Hwan Heo, Jeonghoon Kim, Beomseok Kwon et al.
Large Language Models (LLMs) have recently demonstrated remarkable success across various tasks. However, efficiently serving LLMs has been a challenge due to the large memory bottleneck, specifically in small batch inference settings (e.g. mobile devices). Weight-only quantization can be a promising approach, but sub-4 bit quantization remains a challenge due to large-magnitude activation outliers. To mitigate the undesirable outlier effect, we first propose per-IC quantization, a simple yet effective method that creates quantization groups within each input channel (IC) rather than the conventional per-output-channel (per-OC). Our method is motivated by the observation that activation outliers affect the input dimension of the weight matrix, so similarly grouping the weights in the IC direction can isolate outliers within a group. We also find that activation outliers do not dictate quantization difficulty, and inherent weight sensitivities also exist. With per-IC quantization as a new outlier-friendly scheme, we propose Adaptive Dimensions (AdaDim), a versatile quantization framework that can adapt to various weight sensitivity patterns. We demonstrate the effectiveness of AdaDim by augmenting prior methods such as Round-To-Nearest and GPTQ, showing significant improvements across various language modeling benchmarks for both base (up to +4.7% on MMLU) and instruction-tuned (up to +10% on HumanEval) LLMs. Code is available at https://github.com/johnheo/adadim-llm
CVMay 8, 2023
CrAFT: Compression-Aware Fine-Tuning for Efficient Visual Task AdaptationJung Hwan Heo, Seyedarmin Azizi, Arash Fayyazi et al.
Transfer learning has become a popular task adaptation method in the era of foundation models. However, many foundation models require large storage and computing resources, which makes off-the-shelf deployment impractical. Post-training compression techniques such as pruning and quantization can help lower deployment costs. Unfortunately, the resulting performance degradation limits the usability and benefits of such techniques. To close this performance gap, we propose CrAFT, a simple fine-tuning framework that enables effective post-training network compression. In CrAFT, users simply employ the default fine-tuning schedule along with sharpness minimization objective, simultaneously facilitating task adaptation and compression-friendliness. Contrary to the conventional sharpness minimization techniques, which are applied during pretraining, the CrAFT approach adds negligible training overhead as fine-tuning is done in under a couple of minutes or hours with a single GPU. The effectiveness of CrAFT, which is a general-purpose tool that can significantly boost one-shot pruning and post-training quantization, is demonstrated on both convolution-based and attention-based vision foundation models on a variety of target tasks. The code will be made publicly available.