I-ViT: Integer-only Quantization for Efficient Vision Transformer InferenceZhikai Li, Qingyi Gu
Vision Transformers (ViTs) have achieved state-of-the-art performance on various computer vision applications. However, these models have considerable storage and computational overheads, making their deployment and efficient inference on edge devices challenging. Quantization is a promising approach to reducing model complexity, and the dyadic arithmetic pipeline can allow the quantized models to perform efficient integer-only inference. Unfortunately, dyadic arithmetic is based on the homogeneity condition in convolutional neural networks, which is not applicable to the non-linear components in ViTs, making integer-only inference of ViTs an open issue. In this paper, we propose I-ViT, an integer-only quantization scheme for ViTs, to enable ViTs to perform the entire computational graph of inference with integer arithmetic and bit-shifting, and without any floating-point arithmetic. In I-ViT, linear operations (e.g., MatMul and Dense) follow the integer-only pipeline with dyadic arithmetic, and non-linear operations (e.g., Softmax, GELU, and LayerNorm) are approximated by the proposed light-weight integer-only arithmetic methods. More specifically, I-ViT applies the proposed Shiftmax and ShiftGELU, which are designed to use integer bit-shifting to approximate the corresponding floating-point operations. We evaluate I-ViT on various benchmark models and the results show that integer-only INT8 quantization achieves comparable (or even slightly higher) accuracy to the full-precision (FP) baseline. Furthermore, we utilize TVM for practical hardware deployment on the GPU's integer arithmetic units, achieving 3.72$\sim$4.11$\times$ inference speedup compared to the FP model. Code of both Pytorch and TVM is released at https://github.com/zkkli/I-ViT.
Patch Similarity Aware Data-Free Quantization for Vision TransformersZhikai Li, Liping Ma, Mengjuan Chen et al.
Vision transformers have recently gained great success on various computer vision tasks; nevertheless, their high model complexity makes it challenging to deploy on resource-constrained devices. Quantization is an effective approach to reduce model complexity, and data-free quantization, which can address data privacy and security concerns during model deployment, has received widespread interest. Unfortunately, all existing methods, such as BN regularization, were designed for convolutional neural networks and cannot be applied to vision transformers with significantly different model architectures. In this paper, we propose PSAQ-ViT, a Patch Similarity Aware data-free Quantization framework for Vision Transformers, to enable the generation of "realistic" samples based on the vision transformer's unique properties for calibrating the quantization parameters. Specifically, we analyze the self-attention module's properties and reveal a general difference (patch similarity) in its processing of Gaussian noise and real images. The above insights guide us to design a relative value metric to optimize the Gaussian noise to approximate the real images, which are then utilized to calibrate the quantization parameters. Extensive experiments and ablation studies are conducted on various benchmarks to validate the effectiveness of PSAQ-ViT, which can even outperform the real-data-driven methods. Code is available at: https://github.com/zkkli/PSAQ-ViT.
PSAQ-ViT V2: Towards Accurate and General Data-Free Quantization for Vision TransformersZhikai Li, Mengjuan Chen, Junrui Xiao et al.
Data-free quantization can potentially address data privacy and security concerns in model compression, and thus has been widely investigated. Recently, PSAQ-ViT designs a relative value metric, patch similarity, to generate data from pre-trained vision transformers (ViTs), achieving the first attempt at data-free quantization for ViTs. In this paper, we propose PSAQ-ViT V2, a more accurate and general data-free quantization framework for ViTs, built on top of PSAQ-ViT. More specifically, following the patch similarity metric in PSAQ-ViT, we introduce an adaptive teacher-student strategy, which facilitates the constant cyclic evolution of the generated samples and the quantized model (student) in a competitive and interactive fashion under the supervision of the full-precision model (teacher), thus significantly improving the accuracy of the quantized model. Moreover, without the auxiliary category guidance, we employ the task- and model-independent prior information, making the general-purpose scheme compatible with a broad range of vision tasks and models. Extensive experiments are conducted on various models on image classification, object detection, and semantic segmentation tasks, and PSAQ-ViT V2, with the naive quantization strategy and without access to real-world data, consistently achieves competitive results, showing potential as a powerful baseline on data-free quantization for ViTs. For instance, with Swin-S as the (backbone) model, 8-bit quantization reaches 82.13 top-1 accuracy on ImageNet, 50.9 box AP and 44.1 mask AP on COCO, and 47.2 mIoU on ADE20K. We hope that accurate and general PSAQ-ViT V2 can serve as a potential and practice solution in real-world applications involving sensitive data. Code is released and merged at: https://github.com/zkkli/PSAQ-ViT.
4.6CLOct 11, 2023
QFT: Quantized Full-parameter Tuning of LLMs with Affordable ResourcesZhikai Li, Xiaoxuan Liu, Banghua Zhu et al. · berkeley
Large Language Models (LLMs) have showcased remarkable impacts across a wide spectrum of natural language processing tasks. Fine-tuning these pretrained models on downstream datasets provides further significant performance gains; however, this process typically requires a large number of expensive, high-end GPUs. Although there have been efforts focused on parameter-efficient fine-tuning, they cannot fully unlock the powerful potential of full-parameter fine-tuning. In this paper, we propose QFT, a Quantized Full-parameter Tuning framework for LLMs that quantizes and stores all training states, including weights, gradients, and optimizer states, in INT8 format to reduce training memory, thereby enabling full-parameter fine-tuning on existing GPUs at an affordable cost. To ensure training performance, we make two key efforts: i) for quantized gradients and optimizer states, we theoretically prove that the Lion optimizer, with its property of consistent update magnitudes, is highly robust to quantization; ii) and for quantized weights, we employ the hybrid feature quantizer, which identifies and protects a small subset of sparse critical features while quantizing the remaining dense features, thus ensuring accurate weight updates without FP32 backups. Moreover, to support backpropagation in the integer context, we develop a stack-based gradient flow scheme with O(1) complexity, forming a unified integer training pipeline. As a result, QFT reduces the model state memory to 21% of the standard solution while achieving comparable performance, e.g., tuning a LLaMA-7B model requires only <30GB of memory, making it feasible on a single A6000 GPU.
DilateQuant: Accurate and Efficient Diffusion Quantization via Weight DilationXuewen Liu, Zhikai Li, Minhao Jiang et al.
Model quantization is a promising method for accelerating and compressing diffusion models. Nevertheless, since post-training quantization (PTQ) fails catastrophically at low-bit cases, quantization-aware training (QAT) is essential. Unfortunately, the wide range and time-varying activations in diffusion models sharply increase the complexity of quantization, making existing QAT methods inefficient. Equivalent scaling can effectively reduce activation range, but previous methods remain the overall quantization error unchanged. More critically, these methods significantly disrupt the original weight distribution, resulting in poor weight initialization and challenging convergence during QAT training. In this paper, we propose a novel QAT framework for diffusion models, called DilateQuant. Specifically, we propose Weight Dilation (WD) that maximally dilates the unsaturated in-channel weights to a constrained range through equivalent scaling. WD decreases the activation range while preserving the original weight range, which steadily reduces the quantization error and ensures model convergence. To further enhance accuracy and efficiency, we design a Temporal Parallel Quantizer (TPQ) to address the time-varying activations and introduce a Block-wise Knowledge Distillation (BKD) to reduce resource consumption in training. Extensive experiments demonstrate that DilateQuant significantly outperforms existing methods in terms of accuracy and efficiency. Code is available at http://github.com/BienLuky/DilateQuant .
LLM Inference Unveiled: Survey and Roofline Model InsightsZhihang Yuan, Yuzhang Shang, Yang Zhou et al.
The field of efficient Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges. Although the field has expanded and is vibrant, there hasn't been a concise framework that analyzes the various methods of LLM Inference to provide a clear understanding of this domain. Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model for systematic analysis of LLM inference techniques. This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems, such as why LLMs are memory-bound, how much memory and computation they need, and how to choose the right hardware. We systematically collate the latest advancements in efficient LLM inference, covering crucial areas such as model compression (e.g., Knowledge Distillation and Quantization), algorithm improvements (e.g., Early Exit and Mixture-of-Expert), and both hardware and system-level enhancements. Our survey stands out by analyzing these methods with roofline model, helping us understand their impact on memory access and computation. This distinctive approach not only showcases the current research landscape but also delivers valuable insights for practical implementation, positioning our work as an indispensable resource for researchers new to the field as well as for those seeking to deepen their understanding of efficient LLM deployment. The analyze tool, LLM-Viewer, is open-sourced.
7.6CVSep 14, 2024
Privacy-Preserving SAM Quantization for Efficient Edge Intelligence in HealthcareZhikai Li, Jing Zhang, Qingyi Gu
The disparity in healthcare personnel expertise and medical resources across different regions of the world is a pressing social issue. Artificial intelligence technology offers new opportunities to alleviate this issue. Segment Anything Model (SAM), which excels in intelligent image segmentation, has demonstrated exceptional performance in medical monitoring and assisted diagnosis. Unfortunately, the huge computational and storage overhead of SAM poses significant challenges for deployment on resource-limited edge devices. Quantization is an effective solution for model compression; however, traditional methods rely heavily on original data for calibration, which raises widespread concerns about medical data privacy and security. In this paper, we propose a data-free quantization framework for SAM, called DFQ-SAM, which learns and calibrates quantization parameters without any original data, thus effectively preserving data privacy during model compression. Specifically, we propose pseudo-positive label evolution for segmentation, combined with patch similarity, to fully leverage the semantic and distribution priors in pre-trained models, which facilitates high-quality data synthesis as a substitute for real data. Furthermore, we introduce scale reparameterization to ensure the accuracy of low-bit quantization. We perform extensive segmentation experiments on various datasets, and DFQ-SAM consistently provides significant performance on low-bit quantization. DFQ-SAM eliminates the need for data transfer in cloud-edge collaboration, thereby protecting sensitive data from potential attacks. It enables secure, fast, and personalized healthcare services at the edge, which enhances system efficiency and optimizes resource allocation, and thus facilitating the pervasive application of artificial intelligence in worldwide healthcare.
EDA-DM: Enhanced Distribution Alignment for Post-Training Quantization of Diffusion ModelsXuewen Liu, Zhikai Li, Junrui Xiao et al.
Diffusion models have achieved great success in image generation tasks. However, the lengthy denoising process and complex neural networks hinder their low-latency applications in real-world scenarios. Quantization can effectively reduce model complexity, and post-training quantization (PTQ), which does not require fine-tuning, is highly promising for compressing and accelerating diffusion models. Unfortunately, we find that due to the highly dynamic activations, existing PTQ methods suffer from distribution mismatch issues at both calibration sample level and reconstruction output level, which makes the performance far from satisfactory. In this paper, we propose EDA-DM, a standardized PTQ method that efficiently addresses the above issues. Specifically, at the calibration sample level, we extract information from the density and diversity of latent space feature maps, which guides the selection of calibration samples to align with the overall sample distribution; and at the reconstruction output level, we theoretically analyze the reasons for previous reconstruction failures and, based on this insight, optimize block reconstruction using the Hessian loss of layers, aligning the outputs of quantized model and full-precision model at different network granularity. Extensive experiments demonstrate that EDA-DM significantly outperforms the existing PTQ methods across various models and datasets. Our method achieves a 1.83 times speedup and 4 times compression for the popular Stable-Diffusion on MS-COCO, with only a 0.05 loss in CLIP score. Code is available at http://github.com/BienLuky/EDA-DM .
Rectified SpaAttn: Revisiting Attention Sparsity for Efficient Video GenerationXuewen Liu, Zhikai Li, Jing Zhang et al.
Diffusion Transformers dominate video generation, but the quadratic complexity of attention computation introduces substantial latency. Attention sparsity reduces computational costs by focusing on critical tokens while ignoring non-critical tokens. However, existing methods suffer from severe performance degradation. In this paper, we revisit attention sparsity and reveal that existing methods induce systematic biases in attention allocation: (1) excessive focus on critical tokens amplifies their attention weights; (2) complete neglect of non-critical tokens causes the loss of relevant attention weights. To address these issues, we propose Rectified SpaAttn, which rectifies attention allocation with implicit full attention reference, thereby enhancing the alignment between sparse and full attention maps. Specifically: (1) for critical tokens, we show that their bias is proportional to the sparse attention weights, with the ratio governed by the amplified weights. Accordingly, we propose Isolated-Pooling Attention Reallocation, which calculates accurate rectification factors by reallocating multimodal pooled weights. (2) for non-critical tokens, recovering attention weights from the pooled query-key yields attention gains but also introduces pooling errors. Therefore, we propose Gain-Aware Pooling Rectification, which ensures that the rectified gain consistently surpasses the induced error. Moreover, we customize and integrate the Rectified SpaAttn kernel using Triton, achieving up to 3.33 and 2.08 times speedups on HunyuanVideo and Wan 2.1, respectively, while maintaining high generation quality. We release Rectified SpaAttn as open-source at https://github.com/BienLuky/Rectified-SpaAttn .
15.0LGFeb 8, 2024
RepQuant: Towards Accurate Post-Training Quantization of Large Transformer Models via Scale ReparameterizationZhikai Li, Xuewen Liu, Jing Zhang et al.
Large transformer models have demonstrated remarkable success. Post-training quantization (PTQ), which requires only a small dataset for calibration and avoids end-to-end retraining, is a promising solution for compressing these large models. Regrettably, existing PTQ methods typically exhibit non-trivial performance loss. We find that the performance bottleneck stems from over-consideration of hardware compatibility in the quantization process, compelling them to reluctantly employ simple quantizers, albeit at the expense of accuracy. With the above insights, we propose RepQuant, a novel PTQ framework with quantization-inference decoupling paradigm to address the above issues. RepQuant employs complex quantizers in the quantization process and simplified quantizers in the inference process, and performs mathematically equivalent transformations between the two through quantization scale reparameterization, thus ensuring both accurate quantization and efficient inference. More specifically, we focus on two components with extreme distributions: LayerNorm activations and Softmax activations. Initially, we apply channel-wise quantization and log$\sqrt{2}$ quantization, respectively, which are tailored to their distributions. In particular, for the former, we introduce a learnable per-channel dual clipping scheme, which is designed to efficiently identify outliers in the unbalanced activations with fine granularity. Then, we reparameterize the scales to hardware-friendly layer-wise quantization and log2 quantization for inference. Moreover, quantized weight reconstruction is seamlessly integrated into the above procedure to further push the performance limits. Extensive experiments are performed on different large-scale transformer variants on multiple tasks, including vision, language, and multi-modal transformers, and RepQuant encouragingly demonstrates significant performance advantages.
4.6LGDec 13, 2024
TTAQ: Towards Stable Post-training Quantization in Continuous Domain AdaptationJunrui Xiao, Zhikai Li, Lianwei Yang et al.
Post-training quantization (PTQ) reduces excessive hardware cost by quantizing full-precision models into lower bit representations on a tiny calibration set, without retraining. Despite the remarkable progress made through recent efforts, traditional PTQ methods typically encounter failure in dynamic and ever-changing real-world scenarios, involving unpredictable data streams and continual domain shifts, which poses greater challenges. In this paper, we propose a novel and stable quantization process for test-time adaptation (TTA), dubbed TTAQ, to address the performance degradation of traditional PTQ in dynamically evolving test domains. To tackle domain shifts in quantizer, TTAQ proposes the Perturbation Error Mitigation (PEM) and Perturbation Consistency Reconstruction (PCR). Specifically, PEM analyzes the error propagation and devises a weight regularization scheme to mitigate the impact of input perturbations. On the other hand, PCR introduces consistency learning to ensure that quantized models provide stable predictions for same sample. Furthermore, we introduce Adaptive Balanced Loss (ABL) to adjust the logits by taking advantage of the frequency and complexity of the class, which can effectively address the class imbalance caused by unpredictable data streams during optimization. Extensive experiments are conducted on multiple datasets with generic TTA methods, proving that TTAQ can outperform existing baselines and encouragingly improve the accuracy of low bit PTQ models in continually changing test domains. For instance, TTAQ decreases the mean error of 2-bit models on ImageNet-C dataset by an impressive 10.1\%.
Unsupervised Learning for Class Distribution MismatchPan Du, Wangbo Zhao, Xinai Lu et al.
Class distribution mismatch (CDM) refers to the discrepancy between class distributions in training data and target tasks. Previous methods address this by designing classifiers to categorize classes known during training, while grouping unknown or new classes into an "other" category. However, they focus on semi-supervised scenarios and heavily rely on labeled data, limiting their applicability and performance. To address this, we propose Unsupervised Learning for Class Distribution Mismatch (UCDM), which constructs positive-negative pairs from unlabeled data for classifier training. Our approach randomly samples images and uses a diffusion model to add or erase semantic classes, synthesizing diverse training pairs. Additionally, we introduce a confidence-based labeling mechanism that iteratively assigns pseudo-labels to valuable real-world data and incorporates them into the training process. Extensive experiments on three datasets demonstrate UCDM's superiority over previous semi-supervised methods. Specifically, with a 60% mismatch proportion on Tiny-ImageNet dataset, our approach, without relying on labeled data, surpasses OpenMatch (with 40 labels per class) by 35.1%, 63.7%, and 72.5% in classifying known, unknown, and new classes.
9.1CVMay 24, 2023
BinaryViT: Towards Efficient and Accurate Binary Vision TransformersJunrui Xiao, Zhikai Li, Lianwei Yang et al.
Vision Transformers (ViTs) have emerged as the fundamental architecture for most computer vision fields, but the considerable memory and computation costs hinders their application on resource-limited devices. As one of the most powerful compression methods, binarization reduces the computation of the neural network by quantizing the weights and activation values as $\pm$1. Although existing binarization methods have demonstrated excellent performance on Convolutional Neural Networks (CNNs), the full binarization of ViTs is still under-studied and suffering a significant performance drop. In this paper, we first argue empirically that the severe performance degradation is mainly caused by the weight oscillation in the binarization training and the information distortion in the activation of ViTs. Based on these analyses, we propose $\textbf{BinaryViT}$, an accurate full binarization scheme for ViTs, which pushes the quantization of ViTs to the limit. Specifically, we propose a novel gradient regularization scheme (GRS) for driving a bimodal distribution of the weights to reduce oscillation in binarization training. Moreover, we design an activation shift module (ASM) to adaptively tune the activation distribution to reduce the information distortion caused by binarization. Extensive experiments on ImageNet dataset show that our BinaryViT consistently surpasses the strong baseline by 2.05% and improve the accuracy of fully binarized ViTs to a usable level. Furthermore, our method achieves impressive savings of 16.2$\times$ and 17.7$\times$ in model size and OPs compared to the full-precision DeiT-S.
13.1CVMay 11, 2023
Patch-wise Mixed-Precision Quantization of Vision TransformerJunrui Xiao, Zhikai Li, Lianwei Yang et al.
As emerging hardware begins to support mixed bit-width arithmetic computation, mixed-precision quantization is widely used to reduce the complexity of neural networks. However, Vision Transformers (ViTs) require complex self-attention computation to guarantee the learning of powerful feature representations, which makes mixed-precision quantization of ViTs still challenging. In this paper, we propose a novel patch-wise mixed-precision quantization (PMQ) for efficient inference of ViTs. Specifically, we design a lightweight global metric, which is faster than existing methods, to measure the sensitivity of each component in ViTs to quantization errors. Moreover, we also introduce a pareto frontier approach to automatically allocate the optimal bit-precision according to the sensitivity. To further reduce the computational complexity of self-attention in inference stage, we propose a patch-wise module to reallocate bit-width of patches in each layer. Extensive experiments on the ImageNet dataset shows that our method greatly reduces the search cost and facilitates the application of mixed-precision quantization to ViTs.