ARJan 8, 2024
FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAsShulin Zeng, Jun Liu, Guohao Dai et al. · tsinghua
Transformer-based Large Language Models (LLMs) have made a significant impact on various domains. However, LLMs' efficiency suffers from both heavy computation and memory overheads. Compression techniques like sparsification and quantization are commonly used to mitigate the gap between LLM's computation/memory overheads and hardware capacity. However, existing GPU and transformer-based accelerators cannot efficiently process compressed LLMs, due to the following unresolved challenges: low computational efficiency, underutilized memory bandwidth, and large compilation overheads. This paper proposes FlightLLM, enabling efficient LLMs inference with a complete mapping flow on FPGAs. In FlightLLM, we highlight an innovative solution that the computation and memory overhead of LLMs can be solved by utilizing FPGA-specific resources (e.g., DSP48 and heterogeneous memory hierarchy). We propose a configurable sparse DSP chain to support different sparsity patterns with high computation efficiency. Second, we propose an always-on-chip decode scheme to boost memory bandwidth with mixed-precision support. Finally, to make FlightLLM available for real-world LLMs, we propose a length adaptive compilation method to reduce the compilation overhead. Implemented on the Xilinx Alveo U280 FPGA, FlightLLM achieves 6.0$\times$ higher energy efficiency and 1.8$\times$ better cost efficiency against commercial GPUs (e.g., NVIDIA V100S) on modern LLMs (e.g., LLaMA2-7B) using vLLM and SmoothQuant under the batch size of one. FlightLLM beats NVIDIA A100 GPU with 1.2$\times$ higher throughput using the latest Versal VHK158 FPGA.
LGNov 17, 2024
Towards Accurate and Efficient Sub-8-Bit Integer TrainingWenjin Guo, Donglai Liu, Weiying Xie et al.
Neural network training is a memory- and compute-intensive task. Quantization, which enables low-bitwidth formats in training, can significantly mitigate the workload. To reduce quantization error, recent methods have developed new data formats and additional pre-processing operations on quantizers. However, it remains quite challenging to achieve high accuracy and efficiency simultaneously. In this paper, we explore sub-8-bit integer training from its essence of gradient descent optimization. Our integer training framework includes two components: ShiftQuant to realize accurate gradient estimation, and L1 normalization to smoothen the loss landscape. ShiftQuant attains performance that approaches the theoretical upper bound of group quantization. Furthermore, it liberates group quantization from inefficient memory rearrangement. The L1 normalization facilitates the implementation of fully quantized normalization layers with impressive convergence accuracy. Our method frees sub-8-bit integer training from pre-processing and supports general devices. This framework achieves negligible accuracy loss across various neural networks and tasks ($0.92\%$ on 4-bit ResNets, $0.61\%$ on 6-bit Transformers). The prototypical implementation of ShiftQuant achieves more than $1.85\times/15.3\%$ performance improvement on CPU/GPU compared to its FP16 counterparts, and $33.9\%$ resource consumption reduction on FPGA than the FP16 counterparts. The proposed fully-quantized L1 normalization layers achieve more than $35.54\%$ improvement in throughout on CPU compared to traditional L2 normalization layers. Moreover, theoretical analysis verifies the advancement of our method.
LGJun 4, 2020
Exploring the Potential of Low-bit Training of Convolutional Neural NetworksKai Zhong, Xuefei Ning, Guohao Dai et al.
In this work, we propose a low-bit training framework for convolutional neural networks, which is built around a novel multi-level scaling (MLS) tensor format. Our framework focuses on reducing the energy consumption of convolution operations by quantizing all the convolution operands to low bit-width format. Specifically, we propose the MLS tensor format, in which the element-wise bit-width can be largely reduced. Then, we describe the dynamic quantization and the low-bit tensor convolution arithmetic to leverage the MLS tensor format efficiently. Experiments show that our framework achieves a superior trade-off between the accuracy and the bit-width than previous low-bit training frameworks. For training a variety of models on CIFAR-10, using 1-bit mantissa and 2-bit exponent is adequate to keep the accuracy loss within $1\%$. And on larger datasets like ImageNet, using 4-bit mantissa and 2-bit exponent is adequate to keep the accuracy loss within $1\%$. Through the energy consumption simulation of the computing units, we can estimate that training a variety of models with our framework could achieve $8.3\sim10.2\times$ and $1.9\sim2.3\times$ higher energy efficiency than training with full-precision and 8-bit floating-point arithmetic, respectively.
DCMar 26, 2020
Enabling Efficient and Flexible FPGA Virtualization for Deep Learning in the CloudShulin Zeng, Guohao Dai, Hanbo Sun et al.
FPGAs have shown great potential in providing low-latency and energy-efficient solutions for deep neural network (DNN) inference applications. Currently, the majority of FPGA-based DNN accelerators in the cloud run in a time-division multiplexing way for multiple users sharing a single FPGA, and require re-compilation with $\sim$100 s overhead. Such designs lead to poor isolation and heavy performance loss for multiple users, which are far away from providing efficient and flexible FPGA virtualization for neither public nor private cloud scenarios. To solve these problems, we introduce a novel virtualization framework for instruction architecture set (ISA) based on DNN accelerators by sharing a single FPGA. We enable the isolation by introducing a two-level instruction dispatch module and a multi-core based hardware resources pool. Such designs provide isolated and runtime-programmable hardware resources, further leading to performance isolation for multiple users. On the other hand, to overcome the heavy re-compilation overheads, we propose a tiling-based instruction frame package design and two-stage static-dynamic compilation. Only the light-weight runtime information is re-compiled with $\sim$1 ms overhead, thus the performance is guaranteed for the private cloud. Our extensive experimental results show that the proposed virtualization design achieves 1.07-1.69x and 1.88-3.12x throughput improvement over previous static designs using the single-core and the multi-core architectures, respectively.