LGOct 8, 2023Code
Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?Cheng Zhang, Jianyi Cheng, Ilia Shumailov et al. · deepmind
The inference of Large language models (LLMs) requires immense computation and memory resources. To curtail these costs, quantisation has merged as a promising solution, but existing LLM quantisation mainly focuses on 8-bit. In this work, we explore the statistical and learning properties of the LLM layer and attribute the bottleneck of LLM quantisation to numerical scaling offsets. To address this, we adapt block quantisations for LLMs, a family of methods that share scaling factors across packed numbers. Block quantisations efficiently reduce the numerical scaling offsets solely from an arithmetic perspective, without additional treatments in the computational path. Our nearly-lossless quantised 6-bit LLMs achieve a $19\times$ higher arithmetic density and $5\times$ memory density than the float32 baseline, surpassing the prior art 8-bit quantisation by $2.5\times$ in arithmetic density and $1.2\times$ in memory density, without requiring any data calibration or re-training. We also share our insights into sub-8-bit LLM quantisation, including the mismatch between activation and weight distributions, optimal fine-tuning strategies, and a lower quantisation granularity inherent in the statistical properties of LLMs. The latter two tricks enable nearly-lossless 4-bit LLMs on downstream tasks. Our code is open-sourced.
SYMar 5, 2013
Embedded Online Optimization for Model Predictive Control at Megahertz RatesJuan L. Jerez, Paul J. Goulart, Stefan Richter et al.
Faster, cheaper, and more power efficient optimization solvers than those currently offered by general-purpose solutions are required for extending the use of model predictive control (MPC) to resource-constrained embedded platforms. We propose several custom computational architectures for different first-order optimization methods that can handle linear-quadratic MPC problems with input, input-rate, and soft state constraints. We provide analysis ensuring the reliable operation of the resulting controller under reduced precision fixed-point arithmetic. Implementation of the proposed architectures in FPGAs shows that satisfactory control performance at a sample rate beyond 1 MHz is achievable even on low-end devices, opening up new possibilities for the application of MPC on embedded systems.
ARJul 1, 2024Code
Exploring FPGA designs for MX and beyondEbby Samson, Naveen Mellempudi, Wayne Luk et al.
A number of companies recently worked together to release the new Open Compute Project MX standard for low-precision computation, aimed at efficient neural network implementation. In this paper, we describe and evaluate the first open-source FPGA implementation of the arithmetic defined in the standard. Our designs fully support all the standard's concrete formats for conversion into and out of MX formats and for the standard-defined arithmetic operations, as well as arbitrary fixed-point and floating-point formats. Certain elements of the standard are left as implementation-defined, and we present the first concrete FPGA-inspired choices for these elements, which we outline in the paper. Our library of optimized hardware components is available open source, and can be used to build larger systems. For this purpose, we also describe and release an open-source Pytorch library for quantization into the new standard, integrated with the Brevitas library so that the community can develop novel neural network designs quantized with MX formats in mind. We demonstrate the usability and efficacy of our libraries via the implementation of example neural networks such as ResNet-18 on the ImageNet ILSVRC12 dataset. Our testing shows that MX is very effective for formats such as INT5 or FP6 which are not natively supported on GPUs. This gives FPGAs an advantage as they have the flexibility to implement a custom datapath and take advantage of the smaller area footprints offered by these formats.
LGSep 5, 2023
PolyLUT: Learning Piecewise Polynomials for Ultra-Low Latency FPGA LUT-based InferenceMarta Andronic, George A. Constantinides
Field-programmable gate arrays (FPGAs) are widely used to implement deep learning inference. Standard deep neural network inference involves the computation of interleaved linear maps and nonlinear activation functions. Prior work for ultra-low latency implementations has hardcoded the combination of linear maps and nonlinear activations inside FPGA lookup tables (LUTs). Our work is motivated by the idea that the LUTs in an FPGA can be used to implement a much greater variety of functions than this. In this paper, we propose a novel approach to training neural networks for FPGA deployment using multivariate polynomials as the basic building block. Our method takes advantage of the flexibility offered by the soft logic, hiding the polynomial evaluation inside the LUTs with minimal overhead. We show that by using polynomial building blocks, we can achieve the same accuracy using considerably fewer layers of soft logic than by using linear functions, leading to significant latency and area improvements. We demonstrate the effectiveness of this approach in three tasks: network intrusion detection, jet identification at the CERN Large Hadron Collider, and handwritten digit recognition using the MNIST dataset.
SYOct 24, 2017
Nonlinear Predictive Control on a Heterogeneous Computing PlatformBulat Khusainov, Eric C. Kerrigan, Andrea Suardi et al.
We propose an implementation of an interior-point-based nonlinear predictive controller on a heterogeneous processor. The workload can be split between a general-purpose CPU and a field-programmable gate array to trade off the contradicting design objectives of control performance and computational resource usage. A new way of exploiting the structure of the KKT matrix yields significant memory savings. We report an 18x memory saving, compared to existing approaches, and a 36x speedup over a software implementation with an ARM Cortex-A9 processor. We also introduce a new release of Protoip, which abstracts low-level details of heterogeneous programming and allows processor-in-the-loop verification.
SYOct 24, 2017
Automatic Software and Computing Hardware Co-design for Predictive ControlBulat Khusainov, Eric C. Kerrigan, George A. Constantinides
Model Predictive Control (MPC) is a computationally demanding control technique that allows dealing with multiple-input and multiple-output systems, while handling constraints in a systematic way. The necessity of solving an optimization problem at every sampling instant often (i) limits the application scope to slow dynamical systems and/or (ii) results in expensive computational hardware implementations. Traditional MPC design is based on manual tuning of software and computational hardware design parameters, which leads to suboptimal implementations. This paper proposes a framework for automating the MPC software and computational hardware co-design, while achieving the optimal trade-off between computational resource usage and controller performance. The proposed approach is based on using a multi-objective optimization algorithm, namely BiMADS. Two test studies are considered: Central Processing Unit (CPU) and Field-Programmable Gate Array (FPGA) implementations of fast gradient-based MPC. Numerical experiments show that optimization-based design outperforms Latin Hypercube Sampling (LHS), a statistical sampling-based design exploration technique.
ARAug 9, 2023
FPGA Resource-aware Structured Pruning for Real-Time Neural NetworksBenjamin Ramhorst, Vladimir Loncar, George A. Constantinides
Neural networks achieve state-of-the-art performance in image classification, speech recognition, scientific analysis and many more application areas. Due to the high computational complexity and memory footprint of neural networks, various compression techniques, such as pruning and quantization, have been proposed in literature. Pruning sparsifies a neural network, reducing the number of multiplications and memory. However, pruning often fails to capture properties of the underlying hardware, causing unstructured sparsity and load-balance inefficiency, thus bottlenecking resource improvements. We propose a hardware-centric formulation of pruning, by formulating it as a knapsack problem with resource-aware tensor structures. Evaluated on a range of tasks, including sub-microsecond particle classification at CERN's Large Hadron Collider and fast image classification, the proposed method achieves reductions ranging between 55% and 92% in the DSP utilization and up to 81% in BRAM utilization.
ARApr 17, 2023
ATHEENA: A Toolflow for Hardware Early-Exit Network AutomationBenjamin Biggs, Christos-Savvas Bouganis, George A. Constantinides
The continued need for improvements in accuracy, throughput, and efficiency of Deep Neural Networks has resulted in a multitude of methods that make the most of custom architectures on FPGAs. These include the creation of hand-crafted networks and the use of quantization and pruning to reduce extraneous network parameters. However, with the potential of static solutions already well exploited, we propose to shift the focus to using the varying difficulty of individual data samples to further improve efficiency and reduce average compute for classification. Input-dependent computation allows for the network to make runtime decisions to finish a task early if the result meets a confidence threshold. Early-Exit network architectures have become an increasingly popular way to implement such behaviour in software. We create: A Toolflow for Hardware Early-Exit Network Automation (ATHEENA), an automated FPGA toolflow that leverages the probability of samples exiting early from such networks to scale the resources allocated to different sections of the network. The toolflow uses the data-flow model of fpgaConvNet, extended to support Early-Exit networks as well as Design Space Exploration to optimize the generated streaming architecture hardware with the goal of increasing throughput/reducing area while maintaining accuracy. Experimental results on three different networks demonstrate a throughput increase of $2.00\times$ to $2.78\times$ compared to an optimized baseline network implementation with no early exits. Additionally, the toolflow can achieve a throughput matching the same baseline with as low as $46\%$ of the resources the baseline requires.
LOMar 17, 2022
Abstract Interpretation on E-GraphsSamuel Coward, George A. Constantinides, Theo Drane
Recent e-graph applications have typically considered concrete semantics of expressions, where the notion of equivalence stems from concrete interpretation of expressions. However, equivalences that hold over one interpretation may not hold in an alternative interpretation. Such an observation can be exploited. We consider the application of abstract interpretation to e-graphs, and show that within an e-graph, the lattice meet operation associated with the abstract domain has a natural interpretation for an e-class, leading to improved precision in over-approximation. In this extended abstract, we use Interval Arithmetic (IA) to illustrate this point.
OCFeb 6, 2019
Bounding Computational Complexity under Cost Function Scaling in Predictive ControlIan McInerney, Eric C. Kerrigan, George A. Constantinides
We present a framework for upper bounding the number of iterations required by first-order optimization algorithms implementing constrained LQR controllers. We derive new bounds for the condition number and extremal eigenvalues of the primal and dual Hessian matrices when the cost function is scaled. These bounds are horizon-independent, allowing for their use with receding, variable and decreasing horizon controllers. We considerably relax prior assumptions on the structure of the weight matrices and assume only that the system is Schur-stable and the primal Hessian of the quadratic program (QP) is positive-definite. Our analysis uses the Toeplitz structure of the QP matrices to relate their spectrum to the transfer function of the system, allowing for the use of system-theoretic techniques to compute the bounds. Using these bounds, we can compute the effect on the computational complexity of trading off the input energy used against the state deviation. An example system shows a three-times increase in algorithm iterations between the two extremes, with the state 2-norm decreased by only 5% despite a greatly increased state deviation penalty.
66.5LGMay 8Code
Direction-Preserving Number RepresentationsBardia Zadeh, George A. Constantinides
Low-precision number formats are widely used in modern machine learning systems due to their efficiency. Accurate direction representation is key to the accuracy of vector operations. This work precisely explores the extent to which the direction of a vector can be represented by selecting its scalar elements from a common finite alphabet of a given size. This is standard practice in machine learning, where low-precision significands may be narrow-width floating-point or integer values. A geometric framework is introduced for analyzing the directional coverage of such product-structured codes. This work analytically quantifies the suboptimality gap between such product-structured codes and spherical codes for the vector as a whole, in both low and asymptotically high dimensions. Furthermore, within the product code class, it is proven that the standard formats of two's complement, fixed-point, and floating-point are suboptimal, again with quantified gap, pointing to the potential to develop new scalar number formats. Such scalar alphabets are numerically optimized across multiple block dimensions for directional coverage, including the dimension used in NVIDIA's NVFP4 format. Experimental results are presented comparing the performance of standard formats and the optimized alphabet. We find that for four bits, NVIDIA's choice of E2M1 closely approximates the optimized alphabet, providing a geometric explanation for its strong performance in low-precision machine learning workloads and an analytical understanding of the link between that superiority and block size. We provide open-source formal proofs in Lean for the theorems in this work, along with the experimental code and the optimized alphabets obtained.
LGFeb 4, 2024Code
LQER: Low-Rank Quantization Error Reconstruction for LLMsCheng Zhang, Jianyi Cheng, George A. Constantinides et al.
Post-training quantization of Large Language Models (LLMs) is challenging. In this work, we introduce Low-rank Quantization Error Reduction (LQER), which combines quantization and low-rank approximation to recover the model capability. LQER leverages an activation-induced scale matrix to drive the singular value distribution of quantization error towards a desirable distribution, which enables nearly-lossless W4A8 quantization on various LLMs and downstream tasks without the need for knowledge distillation, grid search, or gradient-base iterative optimization. Unlike existing methods, the computation pattern of LQER eliminates the need for specialized Scatter and Gather processes to collect high-precision weights from irregular memory locations. Our W4A8 LLMs achieve near-lossless performance on six popular downstream tasks, while using 1.36$\times$ fewer hardware resources than the leading state-of-the-art method. We open-source our framework at https://github.com/ChengZhang-98/lqer
LGNov 2, 2025
Training with Fewer Bits: Unlocking Edge LLMs Training with Stochastic RoundingTaowen Liu, Marta Andronic, Deniz Gündüz et al.
LLM training is resource-intensive. Quantized training improves computational and memory efficiency but introduces quantization noise, which can hinder convergence and degrade model accuracy. Stochastic Rounding (SR) has emerged as a theoretically attractive alternative to deterministic rounding, offering unbiased gradient estimates. However, its interaction with other training factors -- especially batch size -- remains under explored. In this paper, we present a theoretical and empirical study of mini-batch stochastic gradient descent (SGD) with SR, showing that increased batch sizes can compensate for reduced precision during back-propagation. Furthermore, we show that quantizing weights and activations impacts gradient variance in distinct ways. Our experiments validate these theoretical insights.
LGJun 21, 2024Code
Unlocking the Global Synergies in Low-Rank AdaptersZixi Zhang, Cheng Zhang, Xitong Gao et al.
Low-rank Adaption (LoRA) has been the de-facto parameter-efficient fine-tuning technique for large language models. We present HeteroLoRA, a light-weight search algorithm that leverages zero-cost proxies to allocate the limited LoRA trainable parameters across the model for better fine-tuned performance. In addition to the allocation for the standard LoRA-adapted models, we also demonstrate the efficacy of HeteroLoRA by performing the allocation in a more challenging search space that includes LoRA modules and LoRA-adapted shortcut connections. Experiments show that HeteroLoRA enables improvements in model performance given the same parameter budge. For example, on MRPC, we see an improvement of 1.6% in accuracy with similar training parameter budget. We will open-source our algorithm once the paper is accepted.
LGFeb 8, 2021Code
Enabling Binary Neural Network Training on the EdgeErwei Wang, James J. Davis, Daniele Moro et al.
The ever-growing computational demands of increasingly complex machine learning models frequently necessitate the use of powerful cloud-based infrastructure for their training. Binary neural networks are known to be promising candidates for on-device inference due to their extreme compute and memory savings over higher-precision alternatives. However, their existing training methods require the concurrent storage of high-precision activations for all layers, generally making learning on memory-constrained devices infeasible. In this article, we demonstrate that the backward propagation operations needed for binary neural network training are strongly robust to quantization, thereby making on-the-edge learning with modern models a practical proposition. We introduce a low-cost binary neural network training strategy exhibiting sizable memory footprint reductions while inducing little to no accuracy loss vs Courbariaux & Bengio's standard approach. These decreases are primarily enabled through the retention of activations exclusively in binary format. Against the latter algorithm, our drop-in replacement sees memory requirement reductions of 3--5$\times$, while reaching similar test accuracy in comparable time, across a range of small-scale models trained to classify popular datasets. We also demonstrate from-scratch ImageNet training of binarized ResNet-18, achieving a 3.78$\times$ memory reduction. Our work is open-source, and includes the Raspberry Pi-targeted prototype we used to verify our modeled memory decreases and capture the associated energy drops. Such savings will allow for unnecessary cloud offloading to be avoided, reducing latency, increasing energy efficiency, and safeguarding end-user privacy.
LGNov 18, 2024
BitMoD: Bit-serial Mixture-of-Datatype LLM AccelerationYuzong Chen, Ahmed F. AbouElhamayed, Xilai Dai et al.
Large language models (LLMs) have demonstrated remarkable performance across various machine learning tasks. Yet the substantial memory footprint of LLMs significantly hinders their deployment. In this paper, we improve the accessibility of LLMs through BitMoD, an algorithm-hardware co-design solution that enables efficient LLM acceleration at low weight precision. On the algorithm side, BitMoD introduces fine-grained data type adaptation that uses a different numerical data type to quantize a group of (e.g., 128) weights. Through the careful design of these new data types, BitMoD is able to quantize LLM weights to very low precision (e.g., 4 bits and 3 bits) while maintaining high accuracy. On the hardware side, BitMoD employs a bit-serial processing element to easily support multiple numerical precisions and data types; our hardware design includes two key innovations: First, it employs a unified representation to process different weight data types, thus reducing the hardware cost. Second, it adopts a bit-serial dequantization unit to rescale the per-group partial sum with minimal hardware overhead. Our evaluation on six representative LLMs demonstrates that BitMoD significantly outperforms state-of-the-art LLM quantization and acceleration methods. For discriminative tasks, BitMoD can quantize LLM weights to 4-bit with $<\!0.5\%$ accuracy loss on average. For generative tasks, BitMoD is able to quantize LLM weights to 3-bit while achieving better perplexity than prior LLM quantization scheme. Combining the superior model performance with an efficient accelerator design, BitMoD achieves an average of $1.69\times$ and $1.48\times$ speedups compared to prior LLM accelerators ANT and OliVe, respectively.
ARFeb 29, 2024
NeuraLUT: Hiding Neural Network Density in Boolean Synthesizable FunctionsMarta Andronic, George A. Constantinides
Field-Programmable Gate Array (FPGA) accelerators have proven successful in handling latency- and resource-critical deep neural network (DNN) inference tasks. Among the most computationally intensive operations in a neural network (NN) is the dot product between the feature and weight vectors. Thus, some previous FPGA acceleration works have proposed mapping neurons with quantized inputs and outputs directly to lookup tables (LUTs) for hardware implementation. In these works, the boundaries of the neurons coincide with the boundaries of the LUTs. We propose relaxing these boundaries and mapping entire sub-networks to a single LUT. As the sub-networks are absorbed within the LUT, the NN topology and precision within a partition do not affect the size of the lookup tables generated. Therefore, we utilize fully connected layers with floating-point precision inside each partition, which benefit from being universal function approximators, but with rigid sparsity and quantization enforced between partitions, where the NN topology becomes exposed to the circuit topology. Although cheap to implement, this approach can lead to very deep NNs, and so to tackle challenges like vanishing gradients, we also introduce skip connections inside the partitions. The resulting methodology can be seen as training DNNs with a specific FPGA hardware-inspired sparsity pattern that allows them to be mapped to much shallower circuit-level networks, thereby significantly improving latency. We validate our proposed method on a known latency-critical task, jet substructure tagging, and on the classical computer vision task, digit classification using MNIST. Our approach allows for greater function expressivity within the LUTs compared to existing work, leading to up to $4.3\times$ lower latency NNs for the same accuracy.
ARDec 24, 2024
ReducedLUT: Table Decomposition with "Don't Care" ConditionsOliver Cassidy, Marta Andronic, Samuel Coward et al.
Lookup tables (LUTs) are frequently used to efficiently store arrays of precomputed values for complex mathematical computations. When used in the context of neural networks, these functions exhibit a lack of recognizable patterns which presents an unusual challenge for conventional logic synthesis techniques. Several approaches are known to break down a single large lookup table into multiple smaller ones that can be recombined. Traditional methods, such as plain tabulation, piecewise linear approximation, and multipartite table methods, often yield inefficient hardware solutions when applied to LUT-based NNs. This paper introduces ReducedLUT, a novel method to reduce the footprint of the LUTs by injecting don't cares into the compression process. This additional freedom introduces more self-similarities which can be exploited using known decomposition techniques. We then demonstrate a particular application to machine learning; by replacing unobserved patterns within the training data of neural network models with don't cares, we enable greater compression with minimal model accuracy degradation. In practice, we achieve up to $1.63\times$ reduction in Physical LUT utilization, with a test accuracy drop of no more than $0.01$ accuracy points.
LGJan 14, 2025
PolyLUT: Ultra-low Latency Polynomial Inference with Hardware-Aware Structured PruningMarta Andronic, Jiawen Li, George A. Constantinides
Standard deep neural network inference involves the computation of interleaved linear maps and nonlinear activation functions. Prior work for ultra-low latency implementations has hardcoded these operations inside FPGA lookup tables (LUTs). However, FPGA LUTs can implement a much greater variety of functions. In this paper, we propose a novel approach to training DNNs for FPGA deployment using multivariate polynomials as the basic building block. Our method takes advantage of the flexibility offered by the soft logic, hiding the polynomial evaluation inside the LUTs with minimal overhead. By using polynomial building blocks, we achieve the same accuracy using considerably fewer layers of soft logic than by using linear functions, leading to significant latency and area improvements. LUT-based implementations also face a significant challenge: the LUT size grows exponentially with the number of inputs. Prior work relies on a priori fixed sparsity, with results heavily dependent on seed selection. To address this, we propose a structured pruning strategy using a bespoke hardware-aware group regularizer that encourages a particular sparsity pattern that leads to a small number of inputs per neuron. We demonstrate the effectiveness of PolyLUT on three tasks: network intrusion detection, jet identification at the CERN Large Hadron Collider, and MNIST.
LGApr 1, 2025
NeuraLUT-Assemble: Hardware-aware Assembling of Sub-Neural Networks for Efficient LUT InferenceMarta Andronic, George A. Constantinides
Efficient neural networks (NNs) leveraging lookup tables (LUTs) have demonstrated significant potential for emerging AI applications, particularly when deployed on field-programmable gate arrays (FPGAs) for edge computing. These architectures promise ultra-low latency and reduced resource utilization, broadening neural network adoption in fields such as particle physics. However, existing LUT-based designs suffer from accuracy degradation due to the large fan-in required by neurons being limited by the exponential scaling of LUT resources with input width. In practice, in prior work this tension has resulted in the reliance on extremely sparse models. We present NeuraLUT-Assemble, a novel framework that addresses these limitations by combining mixed-precision techniques with the assembly of larger neurons from smaller units, thereby increasing connectivity while keeping the number of inputs of any given LUT manageable. Additionally, we introduce skip-connections across entire LUT structures to improve gradient flow. NeuraLUT-Assemble closes the accuracy gap between LUT-based methods and (fully-connected) MLP-based models, achieving competitive accuracy on tasks such as network intrusion detection, digit classification, and jet classification, demonstrating up to $8.42\times$ reduction in the area-delay product compared to the state-of-the-art at the time of the publication.
CLMay 19, 2025
A3 : an Analytical Low-Rank Approximation Framework for AttentionJeffrey T. H. Wong, Cheng Zhang, Xinye Cao et al.
Large language models have demonstrated remarkable performance; however, their massive parameter counts make deployment highly expensive. Low-rank approximation offers a promising compression solution, yet existing approaches have two main limitations: (1) They focus on minimizing the output error of individual linear layers, without considering the architectural characteristics of Transformers, and (2) they decompose a large weight matrix into two small low-rank matrices. Consequently, these methods often fall short compared to other compression techniques like pruning and quantization, and introduce runtime overhead such as the extra GEMM kernel launches for decomposed small matrices. To address these limitations, we propose $\tt A^\tt 3$, a post-training low-rank approximation framework. $\tt A^\tt 3$ splits a Transformer layer into three functional components, namely $\tt QK$, $\tt OV$, and $\tt MLP$. For each component, $\tt A^\tt 3$ provides an analytical solution that reduces the hidden dimension size inside each component while minimizing the component's functional loss ($\it i.e.$, error in attention scores, attention outputs, and MLP outputs). This approach directly reduces model sizes, KV cache sizes, and FLOPs without introducing any runtime overheads. In addition, it provides a new narrative in advancing the optimization problem from singular linear layer loss optimization toward improved end-to-end performance. Through extensive experiments, we show that $\tt A^\tt 3$ maintains superior performance compared to SoTAs. For example, under the same reduction budget in computation and memory, our low-rank approximated LLaMA 3.1-70B achieves a perplexity of 4.69 on WikiText-2, outperforming the previous SoTA's 7.87 by 3.18. We also demonstrate the versatility of $\tt A^\tt 3$, including KV cache compression, quantization, and mixed-rank assignments for enhanced performance.
LGSep 7, 2025
An Improved Template for Approximate ComputingMorteza Rezaalipour, Francesco Costa, Marco Biasion et al.
Deploying neural networks on edge devices entails a careful balance between the energy required for inference and the accuracy of the resulting classification. One technique for navigating this tradeoff is approximate computing: the process of reducing energy consumption by slightly reducing the accuracy of arithmetic operators. In this context, we propose a methodology to reduce the area of the small arithmetic operators used in neural networks - i.e., adders and multipliers - via a small loss in accuracy, and show that we improve area savings for the same accuracy loss w.r.t. the state of the art. To achieve our goal, we improve on a boolean rewriting technique recently proposed, called XPAT, where the use of a parametrisable template to rewrite circuits has proved to be highly beneficial. In particular, XPAT was able to produce smaller circuits than comparable approaches while utilising a naive sum of products template structure. In this work, we show that template parameters can act as proxies for chosen metrics and we propose a novel template based on parametrisable product sharing that acts as a close proxy to synthesised area. We demonstrate experimentally that our methodology converges better to low-area solutions and that it can find better approximations than both the original XPAT and two other state-of-the-art approaches.
LGJun 21, 2024
Optimised Grouped-Query Attention Mechanism for TransformersYuang Chen, Cheng Zhang, Xitong Gao et al.
Grouped-query attention (GQA) has been widely adopted in LLMs to mitigate the complexity of multi-head attention (MHA). To transform an MHA to a GQA, neighbour queries in MHA are evenly split into groups where each group shares the value and key layers. In this work, we propose AsymGQA, an activation-informed approach to asymmetrically grouping an MHA to a GQA for better model performance. Our AsymGQA outperforms the GQA within the same model size budget. For example, AsymGQA LLaMA-2-7B has an accuracy increase of 7.5% on MMLU compared to neighbour grouping. Our approach addresses the GQA's trade-off problem between model performance and hardware efficiency.
SEJan 27, 2022
High-level Synthesis using the Julia LanguageBenjamin Biggs, Ian McInerney, Eric C. Kerrigan et al.
The growing proliferation of FPGAs and High-level Synthesis (HLS) tools has led to a large interest in designing hardware accelerators for complex operations and algorithms. However, existing HLS toolflows typically require a significant amount of user knowledge or training to be effective in both industrial and research applications. In this paper, we propose using the Julia language as the basis for an HLS tool. The Julia HLS tool aims to decrease the barrier to entry for hardware acceleration by taking advantage of the readability of the Julia language and by allowing the use of the existing large library of standard mathematical functions written in Julia. We present a prototype Julia HLS tool, written in Julia, that transforms Julia code to VHDL. We highlight how features of Julia and its compiler simplified the creation of this tool, and we discuss potential directions for future work.
LGDec 4, 2021
Logic Shrinkage: Learned FPGA Netlist Sparsity for Efficient Neural Network InferenceErwei Wang, James J. Davis, Georgios-Ilias Stavrou et al.
FPGA-specific DNN architectures using the native LUTs as independently trainable inference operators have been shown to achieve favorable area-accuracy and energy-accuracy tradeoffs. The first work in this area, LUTNet, exhibited state-of-the-art performance for standard DNN benchmarks. In this paper, we propose the learned optimization of such LUT-based topologies, resulting in higher-efficiency designs than via the direct use of off-the-shelf, hand-designed networks. Existing implementations of this class of architecture require the manual specification of the number of inputs per LUT, K. Choosing appropriate K a priori is challenging, and doing so at even high granularity, e.g. per layer, is a time-consuming and error-prone process that leaves FPGAs' spatial flexibility underexploited. Furthermore, prior works see LUT inputs connected randomly, which does not guarantee a good choice of network topology. To address these issues, we propose logic shrinkage, a fine-grained netlist pruning methodology enabling K to be automatically learned for every LUT in a neural network targeted for FPGA inference. By removing LUT inputs determined to be of low importance, our method increases the efficiency of the resultant accelerators. Our GPU-friendly solution to LUT input removal is capable of processing large topologies during their training with negligible slowdown. With logic shrinkage, we better the area and energy efficiency of the best-performing LUTNet implementation of the CNV network classifying CIFAR-10 by 1.54x and 1.31x, respectively, while matching its accuracy. This implementation also reaches 2.71x the area efficiency of an equally accurate, heavily pruned BNN. On ImageNet with the Bi-Real Net architecture, employment of logic shrinkage results in a post-synthesis area reduction of 2.67x vs LUTNet, allowing for implementation that was previously impossible on today's largest FPGAs.
LGOct 24, 2019
LUTNet: Learning FPGA Configurations for Highly Efficient Neural Network InferenceErwei Wang, James J. Davis, Peter Y. K. Cheung et al.
Research has shown that deep neural networks contain significant redundancy, and thus that high classification accuracy can be achieved even when weights and activations are quantized down to binary values. Network binarization on FPGAs greatly increases area efficiency by replacing resource-hungry multipliers with lightweight XNOR gates. However, an FPGA's fundamental building block, the K-LUT, is capable of implementing far more than an XNOR: it can perform any K-input Boolean operation. Inspired by this observation, we propose LUTNet, an end-to-end hardware-software framework for the construction of area-efficient FPGA-based neural network accelerators using the native LUTs as inference operators. We describe the realization of both unrolled and tiled LUTNet architectures, with the latter facilitating smaller, less power-hungry deployment over the former while sacrificing area and energy efficiency along with throughput. For both varieties, we demonstrate that the exploitation of LUT flexibility allows for far heavier pruning than possible in prior works, resulting in significant area savings while achieving comparable accuracy. Against the state-of-the-art binarized neural network implementation, we achieve up to twice the area efficiency for several standard network models when inferencing popular datasets. We also demonstrate that even greater energy efficiency improvements are obtainable.
LGMay 7, 2019
Rethinking Arithmetic for Deep Neural NetworksGeorge A. Constantinides
We consider efficiency in the implementation of deep neural networks. Hardware accelerators are gaining interest as machine learning becomes one of the drivers of high-performance computing. In these accelerators, the directed graph describing a neural network can be implemented as a directed graph describing a Boolean circuit. We make this observation precise, leading naturally to an understanding of practical neural networks as discrete functions, and show that so-called binarised neural networks are functionally complete. In general, our results suggest that it is valuable to consider Boolean circuits as neural networks, leading to the question of which circuit topologies are promising. We argue that continuity is central to generalisation in learning, explore the interaction between data coding, network topology, and node functionality for continuity, and pose some open questions for future research. As a first step to bridging the gap between continuous and Boolean views of neural network accelerators, we present some recent results from our work on LUTNet, a novel Field-Programmable Gate Array inference approach. Finally, we conclude with additional possible fruitful avenues for research bridging the continuous and discrete views of neural networks.
LGApr 1, 2019
LUTNet: Rethinking Inference in FPGA Soft LogicErwei Wang, James J. Davis, Peter Y. K. Cheung et al.
Research has shown that deep neural networks contain significant redundancy, and that high classification accuracies can be achieved even when weights and activations are quantised down to binary values. Network binarisation on FPGAs greatly increases area efficiency by replacing resource-hungry multipliers with lightweight XNOR gates. However, an FPGA's fundamental building block, the K-LUT, is capable of implementing far more than an XNOR: it can perform any K-input Boolean operation. Inspired by this observation, we propose LUTNet, an end-to-end hardware-software framework for the construction of area-efficient FPGA-based neural network accelerators using the native LUTs as inference operators. We demonstrate that the exploitation of LUT flexibility allows for far heavier pruning than possible in prior works, resulting in significant area savings while achieving comparable accuracy. Against the state-of-the-art binarised neural network implementation, we achieve twice the area efficiency for several standard network models when inferencing popular datasets. We also demonstrate that even greater energy efficiency improvements are obtainable.
CVJan 21, 2019
Deep Neural Network Approximation for Custom Hardware: Where We've Been, Where We're GoingErwei Wang, James J. Davis, Ruizhe Zhao et al.
Deep neural networks have proven to be particularly effective in visual and audio recognition tasks. Existing models tend to be computationally expensive and memory intensive, however, and so methods for hardware-oriented approximation have become a hot topic. Research has shown that custom hardware-based neural network accelerators can surpass their general-purpose processor equivalents in terms of both throughput and energy efficiency. Application-tailored accelerators, when co-designed with approximation-based network training methods, transform large, dense and computationally expensive networks into small, sparse and hardware-efficient alternatives, increasing the feasibility of network deployment. In this article, we provide a comprehensive evaluation of approximation methods for high-performance network inference along with in-depth discussion of their effectiveness for custom hardware implementation. We also include proposals for future research based on a thorough analysis of current trends. This article represents the first survey providing detailed comparisons of custom hardware accelerators featuring approximation for both convolutional and recurrent neural networks, through which we hope to inspire exciting new developments in the field.