Rashmi Agrawal

CR
6papers
84citations
Novelty50%
AI Score27

6 Papers

ARJun 15, 2023
Leveraging Residue Number System for Designing High-Precision Analog Deep Neural Network Accelerators

Cansu Demirkiran, Rashmi Agrawal, Vijay Janapa Reddi et al.

Achieving high accuracy, while maintaining good energy efficiency, in analog DNN accelerators is challenging as high-precision data converters are expensive. In this paper, we overcome this challenge by using the residue number system (RNS) to compose high-precision operations from multiple low-precision operations. This enables us to eliminate the information loss caused by the limited precision of the ADCs. Our study shows that RNS can achieve 99% FP32 accuracy for state-of-the-art DNN inference using data converters with only $6$-bit precision. We propose using redundant RNS to achieve a fault-tolerant analog accelerator. In addition, we show that RNS can reduce the energy consumption of the data converters within an analog accelerator by several orders of magnitude compared to a regular fixed-point approach.

CRJul 3, 2020Code
Fast Arithmetic Hardware Library For RLWE-Based Homomorphic Encryption

Rashmi Agrawal, Lake Bu, Alan Ehret et al.

In this work, we propose an open-source, first-of-its-kind, arithmetic hardware library with a focus on accelerating the arithmetic operations involved in Ring Learning with Error (RLWE)-based somewhat homomorphic encryption (SHE). We design and implement a hardware accelerator consisting of submodules like Residue Number System (RNS), Chinese Remainder Theorem (CRT), NTT-based polynomial multiplication, modulo inverse, modulo reduction, and all the other polynomial and scalar operations involved in SHE. For all of these operations, wherever possible, we include a hardware-cost efficient serial and a fast parallel implementation in the library. A modular and parameterized design approach helps in easy customization and also provides flexibility to extend these operations for use in most homomorphic encryption applications that fit well into emerging FPGA-equipped cloud architectures. Using the submodules from the library, we prototype a hardware accelerator on FPGA. The evaluation of this hardware accelerator shows a speed up of approximately 4200x and 2950x to evaluate a homomorphic multiplication and addition respectively when compared to an existing software implementation.

CRDec 13, 2021
Does Fully Homomorphic Encryption Need Compute Acceleration?

Leo de Castro, Rashmi Agrawal, Rabia Yazicigil et al.

Fully Homomorphic Encryption (FHE) allows arbitrarily complex computations on encrypted data without ever needing to decrypt it, thus enabling us to maintain data privacy on third-party systems. Unfortunately, sustaining deep computations with FHE requires a periodic noise reduction step known as bootstrapping. The cost of the bootstrapping operation is one of the primary barriers to the wide-spread adoption of FHE. In this paper, we present an in-depth architectural analysis of the bootstrapping step in FHE. First, we observe that secure implementations of bootstrapping exhibit a low arithmetic intensity (<1 Op/byte), require large caches (>100 MB), and are heavily bound by the main memory bandwidth. Consequently, we demonstrate that existing workloads observe marginal performance gains from the design of bespoke high-throughput arithmetic units tailored to FHE. Second, we propose several cache-friendly algorithmic optimizations that improve the throughput in FHE bootstrapping by enabling up to 3.2x higher arithmetic intensity and 4.6x lower memory bandwidth. Our optimizations apply to a wide range of structurally similar computations such as private evaluation and training of machine learning models. Finally, we incorporate these optimizations into an architectural tool which, given a cache size, memory subsystem, the number of functional units and a desired security level, selects optimal cryptosystem parameters to maximize the bootstrapping throughput. Our optimized bootstrapping implementation represents a best-case scenario for compute acceleration of FHE. We show that despite these optimizations, bootstrapping continues to be bottlenecked by main memory bandwidth. We propose new research directions to address the underlying memory bottleneck. In summary, our answer to the titular question is: yes, but only after addressing the memory bottleneck!

CRDec 4, 2021
Efficient FPGA-based ECDSA Verification Engine for Permissioned Blockchains

Rashmi Agrawal, Ji Yang, Haris Javaid

As enterprises embrace blockchain technology, many real-world applications have been developed and deployed using permissioned blockchain platforms (access to network is controlled and given to only nodes with known identities). Such blockchain platforms heavily depend on cryptography to provide a layer of trust within the network, thus verification of cryptographic signatures often becomes the bottleneck. The Elliptic Curve Digital Signature Algorithm (ECDSA) is the most commonly used cryptographic scheme in permissioned blockchains. In this paper, we propose an efficient implementation of ECDSA signature verification on an FPGA, in order to improve the performance of permissioned blockchains that aim to use FPGA-based hardware accelerators. We propose several optimizations for modular arithmetic (e.g., custom multipliers and fast modular reduction) and point arithmetic (e.g., reduced number of point double and addition operations, and optimal width NAF representation). Based on these optimized modular and point arithmetic modules, we propose an ECDSA verification engine that can be used by any application for fast verification of ECDSA signatures. We further optimize our ECDSA verification engine for Hyperledger Fabric (one of the most widely used permissioned blockchain platforms) by moving carefully selected operations to a precomputation block, thus simplifying the critical path of ECDSA signature verification. From our implementation on Xilinx Alveo U250 accelerator board with target frequency of 250MHz, our ECDSA verification engine can perform a single verification in $760μs$ resulting in a throughput of 1,315 verifications per second, which is ~2.5x faster than state-of-the-art FPGA-based implementations. Our Hyperledger Fabric-specific ECDSA engine can perform a single verification in $368μs$ with a throughput of 2,717 verifications per second.

CRMar 9, 2019
Post-Quantum Cryptographic Hardware Primitives

Lake Bu, Rashmi Agrawal, Hai Cheng et al.

The development and implementation of post-quantum cryptosystems have become a pressing issue in the design of secure computing systems, as general quantum computers have become more feasible in the last two years. In this work, we introduce a set of hardware post-quantum cryptographic primitives (PCPs) consisting of four frequently used security components, i.e., public-key cryptosystem (PKC), key exchange (KEX), oblivious transfer (OT), and zero-knowledge proof (ZKP). In addition, we design a high speed polynomial multiplier to accelerate these primitives. These primitives will aid researchers and designers in constructing quantum-proof secure computing systems in the post-quantum era.

CRMar 9, 2019
A Lightweight McEliece Cryptosystem Co-processor Design

Lake Bu, Rashmi Agrawal, Hai Cheng et al.

Due to the rapid advances in the development of quantum computers and their susceptibility to errors, there is a renewed interest in error correction algorithms. In particular, error correcting code-based cryptosystems have reemerged as a highly desirable coding technique. This is due to the fact that most classical asymmetric cryptosystems will fail in the quantum computing era. Quantum computers can solve many of the integer factorization and discrete logarithm problems efficiently. However, code-based cryptosystems are still secure against quantum computers, since the decoding of linear codes remains as NP-hard even on these computing systems. One such cryptosystem is the McEliece code-based cryptosystem. The original McEliece code-based cryptosystem uses binary Goppa code, which is known for its good code rate and error correction capability. However, its key generation and decoding procedures have a high computation complexity. In this work we propose a design and hardware implementation of an public-key encryption and decryption co-processor based on a new variant of McEliece system. This co-processor takes the advantage of the non-binary Orthogonal Latin Square Codes to achieve much smaller computation complexity, hardware cost, and the key size.