18.6CRMar 28
Attacking AI Accelerators by Leveraging Arithmetic Properties of AdditionMasoud Heidary, Biresh Kumar Joardar
The dependability of AI models relies largely on the reliability of the underlying computation hardware. Hardware aging attacks can compromise the computing substrate and disrupt AI models over the long run. In this work, we present a new hardware aging attack that exploits commutative properties of addition to disrupt the multiply-and-add operation that forms the backbone of almost all AI models. By permuting the inputs of an adder, the attack preserves functional correctness while inducing unbalanced stress among transistors, accelerating delay degradation in the circuit. Unlike prior approaches that rely on input manipulation, additional trojan circuitry, etc., the proposed method incurs virtually no area or software overhead. Experimental results with two types of multipliers, different bit widths, a mix of AI models and datasets demonstrates that the proposed attack degrades inference accuracy by up to 64% in 4 years, posing a significant threat to AI accelerators. The attack can also be extended to arithmetic units of general-purpose processors.
1.1ARMay 18
Building Reliable Arithmetic Multipliers Under NBTI Aging and Process VariationsMasoud Heidary, Biresh Kumar Joardar
Hardware aging poses a significant challenge for integrated circuits (ICs), leading to performance degradation and eventual failure. In this work, we focus on the aging of arithmetic multipliers, which are a cornerstone of modern computing systems including in CPUs, GPUs, and FPGAs, as well as AI accelerators like systolic arrays. In particular, AI workloads, which rely predominantly on multiplications, can accelerate Negative Bias Temperature Instability (NBTI) effects in multipliers. This paper presents a novel aging mitigation technique that leverages the signinvariance property of multiplication. By selectively applying 2s complement transformations to inputs, the method redistributes stress across transistors, reducing the effects of NBTI aging. The proposed method is also integrated into systolic arrays, a common AI accelerator, to demonstrate its efficiency in a high-throughput AI accelerator. Experimental evaluations using Cadence tools show better lifetime compared to natural aging (with no mitigation) baseline, while introducing negligible area and delay overheads.
ETAug 22, 2025
HePGA: A Heterogeneous Processing-in-Memory based GNN Training AcceleratorChukwufumnanya Ogbogu, Gaurav Narang, Biresh Kumar Joardar et al.
Processing-In-Memory (PIM) architectures offer a promising approach to accelerate Graph Neural Network (GNN) training and inference. However, various PIM devices such as ReRAM, FeFET, PCM, MRAM, and SRAM exist, with each device offering unique trade-offs in terms of power, latency, area, and non-idealities. A heterogeneous manycore architecture enabled by 3D integration can combine multiple PIM devices on a single platform, to enable energy-efficient and high-performance GNN training. In this work, we propose a 3D heterogeneous PIM-based accelerator for GNN training referred to as HePGA. We leverage the unique characteristics of GNN layers and associated computing kernels to optimize their mapping on to different PIM devices as well as planar tiers. Our experimental analysis shows that HePGA outperforms existing PIM-based architectures by up to 3.8x and 6.8x in energy-efficiency (TOPS/W) and compute efficiency (TOPS/mm2) respectively, without sacrificing the GNN prediction accuracy. Finally, we demonstrate the applicability of HePGA to accelerate inferencing of emerging transformer models.
ARJan 19, 2024
FARe: Fault-Aware GNN Training on ReRAM-based PIM AcceleratorsPratyush Dhingra, Chukwufumnanya Ogbogu, Biresh Kumar Joardar et al.
Resistive random-access memory (ReRAM)-based processing-in-memory (PIM) architecture is an attractive solution for training Graph Neural Networks (GNNs) on edge platforms. However, the immature fabrication process and limited write endurance of ReRAMs make them prone to hardware faults, thereby limiting their widespread adoption for GNN training. Further, the existing fault-tolerant solutions prove inadequate for effectively training GNNs in the presence of faults. In this paper, we propose a fault-aware framework referred to as FARe that mitigates the effect of faults during GNN training. FARe outperforms existing approaches in terms of both accuracy and timing overhead. Experimental results demonstrate that FARe framework can restore GNN test accuracy by 47.6% on faulty ReRAM hardware with a ~1% timing overhead compared to the fault-free counterpart.
ETSep 12, 2021
Multi-Objective Optimization of ReRAM Crossbars for Robust DNN Inferencing under Stochastic NoiseXiaoxuan Yang, Syrine Belakaria, Biresh Kumar Joardar et al.
Resistive random-access memory (ReRAM) is a promising technology for designing hardware accelerators for deep neural network (DNN) inferencing. However, stochastic noise in ReRAM crossbars can degrade the DNN inferencing accuracy. We propose the design and optimization of a high-performance, area-and energy-efficient ReRAM-based hardware accelerator to achieve robust DNN inferencing in the presence of stochastic noise. We make two key technical contributions. First, we propose a stochastic-noise-aware training method, referred to as ReSNA, to improve the accuracy of DNN inferencing on ReRAM crossbars with stochastic noise. Second, we propose an information-theoretic algorithm, referred to as CF-MESMO, to identify the Pareto set of solutions to trade-off multiple objectives, including inferencing accuracy, area overhead, execution time, and energy consumption. The main challenge in this context is that executing the ReSNA method to evaluate each candidate ReRAM design is prohibitive. To address this challenge, we utilize the continuous-fidelity evaluation of ReRAM designs associated with prohibitive high computation cost by varying the number of training epochs to trade-off accuracy and cost. CF-MESMO iteratively selects the candidate ReRAM design and fidelity pair that maximizes the information gained per unit computation cost about the optimal Pareto front. Our experiments on benchmark DNNs show that the proposed algorithms efficiently uncover high-quality Pareto fronts. On average, ReSNA achieves 2.57% inferencing accuracy improvement for ResNet20 on the CIFAR-10 dataset with respect to the baseline configuration. Moreover, CF-MESMO algorithm achieves 90.91% reduction in computation cost compared to the popular multi-objective optimization algorithm NSGA-II to reach the best solution from NSGA-II.
DCOct 20, 2018
Learning-based Application-Agnostic 3D NoC Design for Heterogeneous Manycore SystemsBiresh Kumar Joardar, Ryan Gary Kim, Janardhan Rao Doppa et al.
The rising use of deep learning and other big-data algorithms has led to an increasing demand for hardware platforms that are computationally powerful, yet energy-efficient. Due to the amount of data parallelism in these algorithms, high-performance 3D manycore platforms that incorporate both CPUs and GPUs present a promising direction. However, as systems use heterogeneity (e.g., a combination of CPUs, GPUs, and accelerators) to improve performance and efficiency, it becomes more pertinent to address the distinct and likely conflicting communication requirements (e.g., CPU memory access latency or GPU network throughput) that arise from such heterogeneity. Unfortunately, it is difficult to quickly explore the hardware design space and choose appropriate tradeoffs between these heterogeneous requirements. To address these challenges, we propose the design of a 3D Network-on-Chip (NoC) for heterogeneous manycore platforms that considers the appropriate design objectives for a 3D heterogeneous system and explores various tradeoffs using an efficient ML-based multi-objective optimization technique. The proposed design space exploration considers the various requirements of its heterogeneous components and generates a set of 3D NoC architectures that efficiently trades off these design objectives. Our findings show that by jointly considering these requirements (latency, throughput, temperature, and energy), we can achieve 9.6% better Energy-Delay Product on average at nearly iso-temperature conditions when compared to a thermally-optimized design for 3D heterogeneous NoCs. More importantly, our results suggest that our 3D NoCs optimized for a few applications can be generalized for unknown applications as well. Our results show that these generalized 3D NoCs only incur a 1.8% (36-tile system) and 1.1% (64-tile system) average performance loss compared to application-specific NoCs.