Anantha P. Chandrakasan

CR
11papers
429citations
Novelty57%
AI Score52

11 Papers

CVApr 29, 2022
Hardware Trojan Detection Using Unsupervised Deep Learning on Quantum Diamond Microscope Magnetic Field Images

Maitreyi Ashok, Matthew J. Turner, Ronald L. Walsworth et al.

This paper presents a method for hardware trojan detection in integrated circuits. Unsupervised deep learning is used to classify wide field-of-view (4x4 mm$^2$), high spatial resolution magnetic field images taken using a Quantum Diamond Microscope (QDM). QDM magnetic imaging is enhanced using quantum control techniques and improved diamond material to increase magnetic field sensitivity by a factor of 4 and measurement speed by a factor of 16 over previous demonstrations. These upgrades facilitate the first demonstration of QDM magnetic field measurement for hardware trojan detection. Unsupervised convolutional neural networks and clustering are used to infer trojan presence from unlabeled data sets of 600x600 pixel magnetic field images without human bias. This analysis is shown to be more accurate than principal component analysis for distinguishing between field programmable gate arrays configured with trojan free and trojan inserted logic. This framework is tested on a set of scalable trojans that we developed and measured with the QDM. Scalable and TrustHub trojans are detectable down to a minimum trojan trigger size of 0.5% of the total logic. The trojan detection framework can be used for golden-chip free detection, since knowledge of the chips' identities is only used to evaluate detection accuracy

80.5CLMar 30Code
Adaptive Block-Scaled Data Types

Jack Cook, Hyemin S. Lee, Kathryn Le et al.

NVFP4 has grown increasingly popular as a 4-bit format for quantizing large language models due to its hardware support and its ability to retain useful information with relatively few bits per parameter. However, the format is not without limitations: recent work has shown that NVFP4 suffers from its error distribution, resulting in large amounts of quantization error on near-maximal values in each group of 16 values. In this work, we leverage this insight to design new Adaptive Block-Scaled Data Types that can adapt to the distribution of their input values. For four-bit quantization, our proposed IF4 (Int/Float 4) data type selects between FP4 and INT4 representations for each group of 16 values, which are then scaled by an E4M3 scale factor as is done with NVFP4. The selected data type is denoted using the scale factor's sign bit, which is currently unused in NVFP4, and we apply the same insight to design formats for other bit-widths, including IF3 and IF6. When used to quantize language models, we find that IF4 outperforms existing 4-bit block-scaled formats, achieving lower loss during quantized training and achieving higher accuracy on many tasks in post-training quantization. We additionally design and evaluate an IF4 Multiply-Accumulate (MAC) unit to demonstrate that IF4 can be implemented efficiently in next-generation hardware accelerators. Our code is available at https://github.com/mit-han-lab/fouroversix.

50.7ARApr 22
EnergAIzer: Fast and Accurate GPU Power Estimation Framework for AI Workloads

Kyungmi Lee, Zhiye Song, Eun Kyung Lee et al.

As AI workloads drive increases in datacenter power consumption, accurate GPU power estimation is critical for proactive power management. However, existing power models face a scalability bottleneck not in the modeling techniques themselves, but in obtaining the hardware utilization inputs they require. Conventional approaches rely on either costly simulation or hardware profiling, which makes them impractical when rapid predictions are required. This work presents EnergAIzer, which addresses this scalability bottleneck by developing a lightweight solution to predict utilization inputs, reducing the estimation walltime from hours to seconds. Our key insight is that kernels in AI workloads commonly employ optimizations that create structured patterns, which analytically determine memory traffic and execution timeline. We construct a performance model using these patterns as an analytical scaffold for empirical data fitting, which also naturally exposes module-level utilization. This predicted utilization is then fed into our power model to estimate dynamic power consumption. EnergAIzer achieves 8% power errors on NVIDIA Ampere GPUs, competitive with traditional power models with elaborate cycle-level simulation or hardware profiling. We demonstrate EnergAIzer's exploration capabilities for frequency scaling and architectural configurations, including forecasting the power of NVIDIA H100 with just 7% error. In summary, EnergAIzer provides fast and accurate power prediction for AI workloads, paving the way for power-aware design explorations.

60.2LGMay 14
EnergyLens: Predictive Energy-Aware Exploration for Multi-GPU LLM Inference Optimization

Zhiye Song, Kyungmi Lee, Eun Kyung Lee et al.

We present EnergyLens, an end-to-end framework for energy-aware large language model (LLM) inference optimization. As LLMs scale, predicting and reducing their energy footprint has become critical for sustainability and datacenter operations, yet existing approaches either require production-level code and expensive profiling or fail to accurately capture multi-GPU energy behavior. As a result, practitioners lack tools for deciding which optimizations to prioritize and for selecting among existing deployment configurations when exhaustive profiling is impractical. EnergyLens addresses this gap with an intuitive einsum-based interface that captures LLM specifications including fusion, parallelism, and compute-communication overlap, combined with load-imbalance-aware MoE modeling and an empirically driven communication energy model for multi-GPU settings. We validate EnergyLens on Llama3 and Qwen3-MoE across tensor-parallel and expert-parallel configurations, achieving mean absolute percentage errors (MAPEs) between 9.25% and 13.19% for multi-GPU prefill and decode energy, and 12.97% across SM allocations for Megatron-style overlap. Our energy-driven exploration reveals up to 1.47x and 52.9x energy variation across configurations in prefill and decode efficiency and motivates distributed serving. We further show that compute-communication overlap is difficult to optimize with intuition alone, but EnergyLens correctly identifies Pareto-optimal overlap configurations.

CRJan 19, 2022
A Low-Power BLS12-381 Pairing Crypto-Processor for Internet-of-Things Security Applications

Utsav Banerjee, Anantha P. Chandrakasan

We present the first BLS12-381 elliptic curve pairing crypto-processor for Internet-of-Things (IoT) security applications. Efficient finite field arithmetic and algorithm-architecture co-optimizations together enable two orders of magnitude energy savings. We implement several countermeasures against timing and power side-channel attacks. Our crypto-processor is programmable to provide the flexibility to accelerate various elliptic curve and pairing-based protocols such as signature aggregation and functional encryption.

CRMar 26, 2021
Leaky Nets: Recovering Embedded Neural Network Models and Inputs through Simple Power and Timing Side-Channels -- Attacks and Defenses

Saurav Maji, Utsav Banerjee, Anantha P. Chandrakasan

With the recent advancements in machine learning theory, many commercial embedded micro-processors use neural network models for a variety of signal processing applications. However, their associated side-channel security vulnerabilities pose a major concern. There have been several proof-of-concept attacks demonstrating the extraction of their model parameters and input data. But, many of these attacks involve specific assumptions, have limited applicability, or pose huge overheads to the attacker. In this work, we study the side-channel vulnerabilities of embedded neural network implementations by recovering their parameters using timing-based information leakage and simple power analysis side-channel attacks. We demonstrate our attacks on popular micro-controller platforms over networks of different precisions such as floating point, fixed point, binary networks. We are able to successfully recover not only the model parameters but also the inputs for the above networks. Countermeasures against timing-based attacks are implemented and their overheads are analyzed.

LGJun 1, 2020
Rethinking Empirical Evaluation of Adversarial Robustness Using First-Order Attack Methods

Kyungmi Lee, Anantha P. Chandrakasan

We identify three common cases that lead to overestimation of adversarial accuracy against bounded first-order attack methods, which is popularly used as a proxy for adversarial robustness in empirical studies. For each case, we propose compensation methods that either address sources of inaccurate gradient computation, such as numerical instability near zero and non-differentiability, or reduce the total number of back-propagations for iterative attacks by approximating second-order information. These compensation methods can be combined with existing attack methods for a more precise empirical evaluation metric. We illustrate the impact of these three cases with examples of practical interest, such as benchmarking model capacity and regularization techniques for robustness. Overall, our work shows that overestimated adversarial accuracy that is not indicative of robustness is prevalent even for conventionally trained deep neural networks, and highlights cautions of using empirical evaluation without guaranteed bounds.

CROct 16, 2019
Sapphire: A Configurable Crypto-Processor for Post-Quantum Lattice-based Protocols

Utsav Banerjee, Tenzin S. Ukyab, Anantha P. Chandrakasan

Public key cryptography protocols, such as RSA and elliptic curve cryptography, will be rendered insecure by Shor's algorithm when large-scale quantum computers are built. Cryptographers are working on quantum-resistant algorithms, and lattice-based cryptography has emerged as a prime candidate. However, high computational complexity of these algorithms makes it challenging to implement lattice-based protocols on low-power embedded devices. To address this challenge, we present Sapphire - a lattice cryptography processor with configurable parameters. Efficient sampling, with a SHA-3-based PRNG, provides two orders of magnitude energy savings; a single-port RAM-based number theoretic transform memory architecture is proposed, which provides 124k-gate area savings; while a low-power modular arithmetic unit accelerates polynomial computations. Our test chip was fabricated in TSMC 40nm low-power CMOS process, with the Sapphire cryptographic core occupying 0.28 mm2 area consisting of 106k logic gates and 40.25 KB SRAM. Sapphire can be programmed with custom instructions for polynomial arithmetic and sampling, and it is coupled with a low-power RISC-V micro-processor to demonstrate NIST Round 2 lattice-based CCA-secure key encapsulation and signature protocols Frodo, NewHope, qTESLA, CRYSTALS-Kyber and CRYSTALS-Dilithium, achieving up to an order of magnitude improvement in performance and energy-efficiency compared to state-of-the-art hardware implementations. All key building blocks of Sapphire are constant-time and secure against timing and simple power analysis side-channel attacks. We also discuss how masking-based DPA countermeasures can be implemented on the Sapphire core without any changes to the hardware.

CRJul 9, 2019
An Energy-Efficient Reconfigurable DTLS Cryptographic Engine for Securing Internet-of-Things Applications

Utsav Banerjee, Andrew Wright, Chiraag Juvekar et al.

This paper presents the first hardware implementation of the Datagram Transport Layer Security (DTLS) protocol to enable end-to-end security for the Internet of Things (IoT). A key component of this design is a reconfigurable prime field elliptic curve cryptography (ECC) accelerator, which is 238x and 9x more energy-efficient compared to software and state-of-the-art hardware respectively. Our full hardware implementation of the DTLS 1.3 protocol provides 438x improvement in energy-efficiency over software, along with code size and data memory usage as low as 8 KB and 3 KB respectively. The cryptographic accelerators are coupled with an on-chip low-power RISC-V processor to benchmark applications beyond DTLS with up to two orders of magnitude energy savings. The test chip, fabricated in 65 nm CMOS, demonstrates hardware-accelerated DTLS sessions while consuming 44.08 uJ per handshake, and 0.89 nJ per byte of encrypted data at 16 MHz and 0.8 V.

CRMar 11, 2019
An Energy-Efficient Configurable Lattice Cryptography Processor for the Quantum-Secure Internet of Things

Utsav Banerjee, Abhishek Pathak, Anantha P. Chandrakasan

This paper presents a configurable lattice cryptography processor which enables quantum-resistant security protocols for IoT. Efficient sampling architectures, coupled with a low-power SHA-3 core, provide two orders of magnitude energy savings over software. A single-port RAM-based NTT architecture is proposed, which provides ~124k-gate area savings. This is the first ASIC implementation which demonstrates multiple lattice-based protocols proposed for NIST post-quantum standardization.

CRMar 11, 2019
An Energy-Efficient Reconfigurable DTLS Cryptographic Engine for End-to-End Security in IoT Applications

Utsav Banerjee, Chiraag Juvekar, Andrew Wright et al.

This paper presents a reconfigurable cryptographic engine that implements the DTLS protocol to enable end-to-end security for IoT. This implementation of the DTLS engine demonstrates 10x reduction in code size and 438x improvement in energy-efficiency over software. Our ECC primitive is 237x and 9x more energy-efficient compared to software and state-of-the-art hardware respectively. Pairing the DTLS engine with an on-chip RISC-V allows us to demonstrate applications beyond DTLS with up to 2 orders of magnitude energy savings.