Haris Javaid

2papers

2 Papers

ARSep 9, 2022Code
ApproxTrain: Fast Simulation of Approximate Multipliers for DNN Training and Inference

Jing Gong, Hassaan Saadat, Hasindu Gamaarachchi et al.

Edge training of Deep Neural Networks (DNNs) is a desirable goal for continuous learning; however, it is hindered by the enormous computational power required by training. Hardware approximate multipliers have shown their effectiveness for gaining resource-efficiency in DNN inference accelerators; however, training with approximate multipliers is largely unexplored. To build resource efficient accelerators with approximate multipliers supporting DNN training, a thorough evaluation of training convergence and accuracy for different DNN architectures and different approximate multipliers is needed. This paper presents ApproxTrain, an open-source framework that allows fast evaluation of DNN training and inference using simulated approximate multipliers. ApproxTrain is as user-friendly as TensorFlow (TF) and requires only a high-level description of a DNN architecture along with C/C++ functional models of the approximate multiplier. We improve the speed of the simulation at the multiplier level by using a novel LUT-based approximate floating-point (FP) multiplier simulator on GPU (AMSim). ApproxTrain leverages CUDA and efficiently integrates AMSim into the TensorFlow library, in order to overcome the absence of native hardware approximate multiplier in commercial GPUs. We use ApproxTrain to evaluate the convergence and accuracy of DNN training with approximate multipliers for small and large datasets (including ImageNet) using LeNets and ResNets architectures. The evaluations demonstrate similar convergence behavior and negligible change in test accuracy compared to FP32 and bfloat16 multipliers. Compared to CPU-based approximate multiplier simulations in training and inference, the GPU-accelerated ApproxTrain is more than 2500x faster. Based on highly optimized closed-source cuDNN/cuBLAS libraries with native hardware multipliers, the original TensorFlow is only 8x faster than ApproxTrain.

CRDec 4, 2021
Efficient FPGA-based ECDSA Verification Engine for Permissioned Blockchains

Rashmi Agrawal, Ji Yang, Haris Javaid

As enterprises embrace blockchain technology, many real-world applications have been developed and deployed using permissioned blockchain platforms (access to network is controlled and given to only nodes with known identities). Such blockchain platforms heavily depend on cryptography to provide a layer of trust within the network, thus verification of cryptographic signatures often becomes the bottleneck. The Elliptic Curve Digital Signature Algorithm (ECDSA) is the most commonly used cryptographic scheme in permissioned blockchains. In this paper, we propose an efficient implementation of ECDSA signature verification on an FPGA, in order to improve the performance of permissioned blockchains that aim to use FPGA-based hardware accelerators. We propose several optimizations for modular arithmetic (e.g., custom multipliers and fast modular reduction) and point arithmetic (e.g., reduced number of point double and addition operations, and optimal width NAF representation). Based on these optimized modular and point arithmetic modules, we propose an ECDSA verification engine that can be used by any application for fast verification of ECDSA signatures. We further optimize our ECDSA verification engine for Hyperledger Fabric (one of the most widely used permissioned blockchain platforms) by moving carefully selected operations to a precomputation block, thus simplifying the critical path of ECDSA signature verification. From our implementation on Xilinx Alveo U250 accelerator board with target frequency of 250MHz, our ECDSA verification engine can perform a single verification in $760μs$ resulting in a throughput of 1,315 verifications per second, which is ~2.5x faster than state-of-the-art FPGA-based implementations. Our Hyperledger Fabric-specific ECDSA engine can perform a single verification in $368μs$ with a throughput of 2,717 verifications per second.