20.5LGMay 3, 2022
Efficient Fine-Tuning of BERT Models on the EdgeDanilo Vucetic, Mohammadreza Tayaranian, Maryam Ziaeefard et al.
Resource-constrained devices are increasingly the deployment targets of machine learning applications. Static models, however, do not always suffice for dynamic environments. On-device training of models allows for quick adaptability to new scenarios. With the increasing size of deep neural networks, as noted with the likes of BERT and other natural language processing models, comes increased resource requirements, namely memory, computation, energy, and time. Furthermore, training is far more resource intensive than inference. Resource-constrained on-device learning is thus doubly difficult, especially with large BERT-like models. By reducing the memory usage of fine-tuning, pre-trained BERT models can become efficient enough to fine-tune on resource-constrained devices. We propose Freeze And Reconfigure (FAR), a memory-efficient training regime for BERT-like models that reduces the memory usage of activation maps during fine-tuning by avoiding unnecessary parameter updates. FAR reduces fine-tuning time on the DistilBERT model and CoLA dataset by 30%, and time spent on memory operations by 47%. More broadly, reductions in metric performance on the GLUE and SQuAD datasets are around 1% on average.
7.8ARMar 27
Efficient CMOS Invertible Logic Using Stochastic ComputingSean C. Smithson, Naoya Onizawa, Brett H. Meyer et al.
Invertible logic can operate in one of two modes: 1) a forward mode, in which inputs are presented and a single, correct output is produced, and 2) a reverse mode, in which the output is fixed and the inputs take on values consistent with the output. It is possible to create invertible logic using various Boltzmann machine configurations. Such systems have been shown to solve certain challenging problems quickly, such as factorization and combinatorial optimization. In this paper, we show that invertible logic can be implemented using simple spiking neural networks based on stochastic computing. We present a design methodology for invertible stochastic gates, which can be implemented using a small amount of CMOS hardware. We demonstrate that our design can not only correctly implement basic gates with invertible capability, but can also be extended to construct invertible stochastic adder and multiplier circuits. Experimental results are presented which demonstrate correct operation of synthesizable invertible circuitry performing both multiplication and factorization, along with fabricated ASIC measurement results for an invertible multiplier circuit.
0.6CLAug 3, 2022
Efficient Fine-Tuning of Compressed Language Models with LearnersDanilo Vucetic, Mohammadreza Tayaranian, Maryam Ziaeefard et al.
Fine-tuning BERT-based models is resource-intensive in memory, computation, and time. While many prior works aim to improve inference efficiency via compression techniques, e.g., pruning, these works do not explicitly address the computational challenges of training to downstream tasks. We introduce Learner modules and priming, novel methods for fine-tuning that exploit the overparameterization of pre-trained language models to gain benefits in convergence speed and resource utilization. Learner modules navigate the double bind of 1) training efficiently by fine-tuning a subset of parameters, and 2) training effectively by ensuring quick convergence and high metric scores. Our results on DistilBERT demonstrate that learners perform on par with or surpass the baselines. Learners train 7x fewer parameters than state-of-the-art methods on GLUE. On CoLA, learners fine-tune 20% faster, and have significantly lower resource utilization.
2.6CVSep 15, 2022
CES-KD: Curriculum-based Expert Selection for Guided Knowledge DistillationIbtihel Amara, Maryam Ziaeefard, Brett H. Meyer et al.
Knowledge distillation (KD) is an effective tool for compressing deep classification models for edge devices. However, the performance of KD is affected by the large capacity gap between the teacher and student networks. Recent methods have resorted to a multiple teacher assistant (TA) setting for KD, which sequentially decreases the size of the teacher model to relatively bridge the size gap between these models. This paper proposes a new technique called Curriculum Expert Selection for Knowledge Distillation (CES-KD) to efficiently enhance the learning of a compact student under the capacity gap problem. This technique is built upon the hypothesis that a student network should be guided gradually using stratified teaching curriculum as it learns easy (hard) data samples better and faster from a lower (higher) capacity teacher network. Specifically, our method is a gradual TA-based KD technique that selects a single teacher per input image based on a curriculum driven by the difficulty in classifying the image. In this work, we empirically verify our hypothesis and rigorously experiment with CIFAR-10, CIFAR-100, CINIC-10, and ImageNet datasets and show improved accuracy on VGG-like models, ResNets, and WideResNets architectures.
6.5CVDec 25, 2022
BD-KD: Balancing the Divergences for Online Knowledge DistillationIbtihel Amara, Nazanin Sepahvand, Brett H. Meyer et al.
We address the challenge of producing trustworthy and accurate compact models for edge devices. While Knowledge Distillation (KD) has improved model compression in terms of achieving high accuracy performance, calibration of these compact models has been overlooked. We introduce BD-KD (Balanced Divergence Knowledge Distillation), a framework for logit-based online KD. BD-KD enhances both accuracy and model calibration simultaneously, eliminating the need for post-hoc recalibration techniques, which add computational overhead to the overall training pipeline and degrade performance. Our method encourages student-centered training by adjusting the conventional online distillation loss on both the student and teacher losses, employing sample-wise weighting of forward and reverse Kullback-Leibler divergence. This strategy balances student network confidence and boosts performance. Experiments across CIFAR10, CIFAR100, TinyImageNet, and ImageNet datasets, and various architectures demonstrate improved calibration and accuracy compared to recent online KD methods.
1.5CVApr 21, 2023
SSS3D: Fast Neural Architecture Search For Efficient Three-Dimensional Semantic SegmentationOlivier Therrien, Marihan Amein, Zhuoran Xiong et al.
We present SSS3D, a fast multi-objective NAS framework designed to find computationally efficient 3D semantic scene segmentation networks. It uses RandLA-Net, an off-the-shelf point-based network, as a super-network to enable weight sharing and reduce search time by 99.67% for single-stage searches. SSS3D has a complex search space composed of sampling and architectural parameters that can form 2.88 * 10^17 possible networks. To further reduce search time, SSS3D splits the complete search space and introduces a two-stage search that finds optimal subnetworks in 54% of the time required by single-stage searches.
1.5CVMar 28, 2023
FMAS: Fast Multi-Objective SuperNet Architecture Search for Semantic SegmentationZhuoran Xiong, Marihan Amein, Olivier Therrien et al.
We present FMAS, a fast multi-objective neural architecture search framework for semantic segmentation. FMAS subsamples the structure and pre-trained parameters of DeepLabV3+, without fine-tuning, dramatically reducing training time during search. To further reduce candidate evaluation time, we use a subset of the validation dataset during the search. Only the final, Pareto non-dominated, candidates are ultimately fine-tuned using the complete training set. We evaluate FMAS by searching for models that effectively trade accuracy and computational cost on the PASCAL VOC 2012 dataset. FMAS finds competitive designs quickly, e.g., taking just 0.5 GPU days to discover a DeepLabV3+ variant that reduces FLOPs and parameters by 10$\%$ and 20$\%$ respectively, for less than 3$\%$ increased error. We also search on an edge device called GAP8 and use its latency as the metric. FMAS is capable of finding 2.2$\times$ faster network with 7.61$\%$ MIoU loss.
Automatic Pruning of Fine-tuning Datasets for Transformer-based Language ModelsMohammadreza Tayaranian, Seyyed Hasan Mozafari, Brett H. Meyer et al.
Transformer-based language models have shown state-of-the-art performance on a variety of natural language understanding tasks. To achieve this performance, these models are first pre-trained on general corpus and then fine-tuned on downstream tasks. Previous work studied the effect of pruning the training set of the downstream tasks on the performance of the model on its evaluation set. In this work, we propose an automatic dataset pruning method for the training set of fine-tuning tasks. Our method is based on the model's success rate in correctly classifying each training data point. Unlike previous work which relies on user feedback to determine subset size, our method automatically extracts training subsets that are adapted for each pair of model and fine-tuning task. Our method provides multiple subsets for use in dataset pruning that navigate the trade-off between subset size and evaluation accuracy. Our largest subset, which we also refer to as the winning ticket subset, is on average $3 \times$ smaller than the original training set of the fine-tuning task. Our experiments on 5 downstream tasks and 2 language models show that, on average, fine-tuning on the winning ticket subsets results in a $0.1 \%$ increase in the evaluation performance of the model.
5.8LGFeb 26
InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language ModelsSayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross
Reducing the hardware footprint of large language models (LLMs) during decoding is critical for efficient long-sequence generation. A key bottleneck is the key-value (KV) cache, whose size scales with sequence length and easily dominates the memory footprint of the model. Previous work proposed quantization methods that are focused on compressing the KV cache while maintaining its information. We introduce InnerQ, a hardware-aware KV-cache quantization scheme that lowers decode latency without sacrificing accuracy. InnerQ applies group-wise quantization while grouping the cache matrices over their inner dimension. Unlike previous work that group over the outer dimension, InnerQ aligns dequantization with the vector-matrix multiplication and enables scale factor reuse across GPU compute units. This reduces memory accesses and accelerates dequantization, yielding up to $22\%$ speedup over previous work and up to $88\%$ over half-precision vector-matrix multiplication. To preserve fidelity under aggressive compression, InnerQ incorporates (i) hybrid quantization, selecting symmetric or asymmetric quantization per group based on local statistics; (ii) high-precision windows for both the most recent tokens and the attention sink tokens to mitigate outlier leakage; and (iii) per-channel normalization of the key cache, computed once during prefill and folded into the query to avoid runtime overhead. Our evaluation experiments on Llama models shows that InnerQ maintains a few-shot GSM8K performance comparable to non-quantized KV caches and surpasses prior KV cache quantization methods.
3.7CVFeb 2, 2024
Faster Inference of Integer SWIN Transformer by Removing the GELU ActivationMohammadreza Tayaranian, Seyyed Hasan Mozafari, James J. Clark et al.
SWIN transformer is a prominent vision transformer model that has state-of-the-art accuracy in image classification tasks. Despite this success, its unique architecture causes slower inference compared with similar deep neural networks. Integer quantization of the model is one of the methods used to improve its inference latency. However, state-of-the-art has not been able to fully quantize the model. In this work, we improve upon the inference latency of the state-of-the-art methods by removing the floating-point operations, which are associated with the GELU activation in Swin Transformer. While previous work proposed to replace the non-integer operations with linear approximation functions, we propose to replace GELU with ReLU activation. The advantage of ReLU over previous methods is its low memory and computation complexity. We use iterative knowledge distillation to compensate for the lost accuracy due to replacing GELU with ReLU. We quantize our GELU-less SWIN transformer and show that on an RTX 4090 NVIDIA GPU we can improve the inference latency of the quantized SWIN transformer by at least $11\%$ while maintaining an accuracy drop of under $0.5\%$ on the ImageNet evaluation dataset.
2.0CVJan 24, 2024
AdCorDA: Classifier Refinement via Adversarial Correction and Domain AdaptationLulan Shen, Ali Edalati, Brett Meyer et al.
This paper describes a simple yet effective technique for refining a pretrained classifier network. The proposed AdCorDA method is based on modification of the training set and making use of the duality between network weights and layer inputs. We call this input space training. The method consists of two stages - adversarial correction followed by domain adaptation. Adversarial correction uses adversarial attacks to correct incorrect training-set classifications. The incorrectly classified samples of the training set are removed and replaced with the adversarially corrected samples to form a new training set, and then, in the second stage, domain adaptation is performed back to the original training set. Extensive experimental validations show significant accuracy boosts of over 5% on the CIFAR-100 dataset. The technique can be straightforwardly applied to refinement of weight-quantized neural networks, where experiments show substantial enhancement in performance over the baseline. The adversarial correction technique also results in enhanced robustness to adversarial attacks.
2.6LGJan 22, 2024
Robustness to distribution shifts of compressed networks for edge devicesLulan Shen, Ali Edalati, Brett Meyer et al.
It is necessary to develop efficient DNNs deployed on edge devices with limited computation resources. However, the compressed networks often execute new tasks in the target domain, which is different from the source domain where the original network is trained. It is important to investigate the robustness of compressed networks in two types of data distribution shifts: domain shifts and adversarial perturbations. In this study, we discover that compressed models are less robust to distribution shifts than their original networks. Interestingly, larger networks are more vulnerable to losing robustness than smaller ones, even when they are compressed to a similar size as the smaller networks. Furthermore, compact networks obtained by knowledge distillation are much more robust to distribution shifts than pruned networks. Finally, post-training quantization is a reliable method for achieving significant robustness to distribution shifts, and it outperforms both pruned and distilled models in terms of robustness.
1.8LGFeb 24, 2022
Standard Deviation-Based Quantization for Deep Neural NetworksAmir Ardakani, Arash Ardakani, Brett Meyer et al.
Quantization of deep neural networks is a promising approach that reduces the inference cost, making it feasible to run deep networks on resource-restricted devices. Inspired by existing methods, we propose a new framework to learn the quantization intervals (discrete values) using the knowledge of the network's weight and activation distributions, i.e., standard deviation. Furthermore, we propose a novel base-2 logarithmic quantization scheme to quantize weights to power-of-two discrete values. Our proposed scheme allows us to replace resource-hungry high-precision multipliers with simple shift-add operations. According to our evaluations, our method outperforms existing work on CIFAR10 and ImageNet datasets and even achieves better accuracy performance with 3-bit weights and activations when compared to the full-precision models. Moreover, our scheme simultaneously prunes the network's parameters and allows us to flexibly adjust the pruning ratio during the quantization process.
4.3ITSep 15, 2020
Decoding Polar Codes with Reinforcement LearningNghia Doan, Seyyed Ali Hashemi, Warren Gross
In this paper we address the problem of selecting factor-graph permutations of polar codes under belief propagation (BP) decoding to significantly improve the error-correction performance of the code. In particular, we formalize the factor-graph permutation selection as the multi-armed bandit problem in reinforcement learning and propose a decoder that acts like an online-learning agent that learns to select the good factor-graph permutations during the course of decoding. We use state-of-the-art algorithms for the multi-armed bandit problem and show that for a 5G polar codes of length 128 with 64 information bits, the proposed decoder has an error-correction performance gain of around 0.125 dB at the target frame error rate of 10^{-4}, when compared to the approach that randomly selects the factor-graph permutations.
2.9LGNov 9, 2018
Learning to Skip Ineffectual Recurrent Computations in LSTMsArash Ardakani, Zhengyun Ji, Warren J. Gross
Long Short-Term Memory (LSTM) is a special class of recurrent neural network, which has shown remarkable successes in processing sequential data. The typical architecture of an LSTM involves a set of states and gates: the states retain information over arbitrary time intervals and the gates regulate the flow of information. Due to the recursive nature of LSTMs, they are computationally intensive to deploy on edge devices with limited hardware resources. To reduce the computational complexity of LSTMs, we first introduce a method that learns to retain only the important information in the states by pruning redundant information. We then show that our method can prune over 90% of information in the states without incurring any accuracy degradation over a set of temporal tasks. This observation suggests that a large fraction of the recurrent computations are ineffectual and can be avoided to speed up the process during the inference as they involve noncontributory multiplications/accumulations with zero-valued states. Finally, we introduce a custom hardware accelerator that can perform the recurrent computations using both sparse and dense states. Experimental measurements show that performing the computations using the sparse states speeds up the process and improves energy efficiency by up to 5.2x when compared to implementation results of the accelerator performing the computations using dense states.
11.7ITOct 23, 2018
Learning from the SyndromeLoren Lugosch, Warren J. Gross
In this paper, we introduce the syndrome loss, an alternative loss function for neural error-correcting decoders based on a relaxation of the syndrome. The syndrome loss penalizes the decoder for producing outputs that do not correspond to valid codewords. We show that training with the syndrome loss yields decoders with consistently lower frame error rate for a number of short block codes, at little additional cost during training and no additional cost during inference. The proposed method does not depend on knowledge of the transmitted codeword, making it a promising tool for online adaptation to changing channel conditions.
Learning Recurrent Binary/Ternary WeightsArash Ardakani, Zhengyun Ji, Sean C. Smithson et al.
Recurrent neural networks (RNNs) have shown excellent performance in processing sequence data. However, they are both complex and memory intensive due to their recursive nature. These limitations make RNNs difficult to embed on mobile devices requiring real-time processes with limited hardware resources. To address the above issues, we introduce a method that can learn binary and ternary weights during the training phase to facilitate hardware implementations of RNNs. As a result, using this approach replaces all multiply-accumulate operations by simple accumulations, bringing significant benefits to custom hardware in terms of silicon area and power consumption. On the software side, we evaluate the performance (in terms of accuracy) of our method using long short-term memories (LSTMs) on various sequential models including sequence classification and language modeling. We demonstrate that our method achieves competitive results on the aforementioned tasks while using binary/ternary weights during the runtime. On the hardware side, we present custom hardware for accelerating the recurrent computations of LSTMs with binary/ternary weights. Ultimately, we show that LSTMs with binary/ternary weights can achieve up to 12x memory saving and 10x inference speedup compared to the full-precision implementation on an ASIC platform.
29.5ITJun 21, 2017
Deep Learning Methods for Improved Decoding of Linear CodesEliya Nachmani, Elad Marciano, Loren Lugosch et al.
The problem of low complexity, close to optimal, channel decoding of linear codes with short to moderate block length is considered. It is shown that deep learning methods can be used to improve a standard belief propagation decoder, despite the large example space. Similar improvements are obtained for the min-sum algorithm. It is also shown that tying the parameters of the decoders across iterations, so as to form a recurrent neural network architecture, can be implemented with comparable results. The advantage is that significantly less parameters are required. We also introduce a recurrent neural decoder architecture based on the method of successive relaxation. Improvements over standard belief propagation are also observed on sparser Tanner graph representations of the codes. Furthermore, we demonstrate that the neural belief propagation decoder can be used to improve the performance, or alternatively reduce the computational complexity, of a close to optimal decoder of short BCH codes.
14.5ITJan 20, 2017
Neural Offset Min-Sum DecodingLoren Lugosch, Warren J. Gross
Recently, it was shown that if multiplicative weights are assigned to the edges of a Tanner graph used in belief propagation decoding, it is possible to use deep learning techniques to find values for the weights which improve the error-correction performance of the decoder. Unfortunately, this approach requires many multiplications, which are generally expensive operations. In this paper, we suggest a more hardware-friendly approach in which offset min-sum decoding is augmented with learnable offset parameters. Our method uses no multiplications and has a parameter count less than half that of the multiplicative algorithm. This both speeds up training and provides a feasible path to hardware architectures. After describing our method, we compare the performance of the two neural decoding algorithms and show that our method achieves error-correction performance within 0.1 dB of the multiplicative approach and as much as 1 dB better than traditional belief propagation for the codes under consideration.
15.8NENov 7, 2016
Neural Networks Designing Neural Networks: Multi-Objective Hyper-Parameter OptimizationSean C. Smithson, Guang Yang, Warren J. Gross et al.
Artificial neural networks have gone through a recent rise in popularity, achieving state-of-the-art results in various fields, including image classification, speech recognition, and automated control. Both the performance and computational complexity of such models are heavily dependant on the design of characteristic hyper-parameters (e.g., number of hidden layers, nodes per layer, or choice of activation functions), which have traditionally been optimized manually. With machine learning penetrating low-power mobile and embedded areas, the need to optimize not only for performance (accuracy), but also for implementation complexity, becomes paramount. In this work, we present a multi-objective design space exploration method that reduces the number of solution networks trained and evaluated through response surface modelling. Given spaces which can easily exceed 1020 solutions, manually designing a near-optimal architecture is unlikely as opportunities to reduce network complexity, while maintaining performance, may be overlooked. This problem is exacerbated by the fact that hyper-parameters which perform well on specific datasets may yield sub-par results on others, and must therefore be designed on a per-application basis. In our work, machine learning is leveraged by training an artificial neural network to predict the performance of future candidate networks. The method is evaluated on the MNIST and CIFAR-10 image datasets, optimizing for both recognition accuracy and computational complexity. Experimental results demonstrate that the proposed method can closely approximate the Pareto-optimal front, while only exploring a small fraction of the design space.
14.9NENov 4, 2016
Sparsely-Connected Neural Networks: Towards Efficient VLSI Implementation of Deep Neural NetworksArash Ardakani, Carlo Condo, Warren J. Gross
Recently deep neural networks have received considerable attention due to their ability to extract and represent high-level abstractions in data sets. Deep neural networks such as fully-connected and convolutional neural networks have shown excellent performance on a wide range of recognition and classification tasks. However, their hardware implementations currently suffer from large silicon area and high power consumption due to the their high degree of complexity. The power/energy consumption of neural networks is dominated by memory accesses, the majority of which occur in fully-connected networks. In fact, they contain most of the deep neural network parameters. In this paper, we propose sparsely-connected networks, by showing that the number of connections in fully-connected networks can be reduced by up to 90% while improving the accuracy performance on three popular datasets (MNIST, CIFAR10 and SVHN). We then propose an efficient hardware architecture based on linear-feedback shift registers to reduce the memory requirements of the proposed sparsely-connected networks. The proposed architecture can save up to 90% of memory compared to the conventional implementations of fully-connected neural networks. Moreover, implementation results show up to 84% reduction in the energy consumption of a single neuron of the proposed sparsely-connected networks compared to a single neuron of fully-connected neural networks.
16.7NESep 29, 2015
VLSI Implementation of Deep Neural Network Using Integral Stochastic ComputingArash Ardakani, François Leduc-Primeau, Naoya Onizawa et al.
The hardware implementation of deep neural networks (DNNs) has recently received tremendous attention: many applications in fact require high-speed operations that suit a hardware implementation. However, numerous elements and complex interconnections are usually required, leading to a large area occupation and copious power consumption. Stochastic computing has shown promising results for low-power area-efficient hardware implementations, even though existing stochastic algorithms require long streams that cause long latencies. In this paper, we propose an integer form of stochastic computation and introduce some elementary circuits. We then propose an efficient implementation of a DNN based on integral stochastic computing. The proposed architecture has been implemented on a Virtex7 FPGA, resulting in 45% and 62% average reductions in area and latency compared to the best reported architecture in literature. We also synthesize the circuits in a 65 nm CMOS technology and we show that the proposed integral stochastic architecture results in up to 21% reduction in energy consumption compared to the binary radix implementation at the same misclassification rate. Due to fault-tolerant nature of stochastic architectures, we also consider a quasi-synchronous implementation which yields 33% reduction in energy consumption w.r.t. the binary radix implementation without any compromise on performance.