Behzad Salami

AR
8papers
161citations
Novelty43%
AI Score48

8 Papers

ARNov 10, 2022
NEON: Enabling Efficient Support for Nonlinear Operations in Resistive RAM-based Neural Network Accelerators

Aditya Manglik, Minesh Patel, Haiyu Mao et al.

Resistive Random-Access Memory (RRAM) is well-suited to accelerate neural network (NN) workloads as RRAM-based Processing-in-Memory (PIM) architectures natively support highly-parallel multiply-accumulate (MAC) operations that form the backbone of most NN workloads. Unfortunately, NN workloads such as transformers require support for non-MAC operations (e.g., softmax) that RRAM cannot provide natively. Consequently, state-of-the-art works either integrate additional digital logic circuits to support the non-MAC operations or offload the non-MAC operations to CPU/GPU, resulting in significant performance and energy efficiency overheads due to data movement. In this work, we propose NEON, a novel compiler optimization to enable the end-to-end execution of the NN workload in RRAM. The key idea of NEON is to transform each non-MAC operation into a lightweight yet highly-accurate neural network. Utilizing neural networks to approximate the non-MAC operations provides two advantages: 1) We can exploit the key strength of RRAM, i.e., highly-parallel MAC operation, to flexibly and efficiently execute non-MAC operations in memory. 2) We can simplify RRAM's microarchitecture by eliminating the additional digital logic circuits while reducing the data movement overheads. Acceleration of the non-MAC operations in memory enables NEON to achieve a 2.28x speedup compared to an idealized digital logic-based RRAM. We analyze the trade-offs associated with the transformation and demonstrate feasible use cases for NEON across different substrates.

78.8ARMay 6Code
REPTILES: Repeated Tiles of Sargantana, a RISC-V multicore based on OpenPiton

Noelia Oliete-Escuín, Arnau Bigas, Narcís Rodas et al.

Chip industry continues advancing and expanding modern computing systems, resulting in more complex multi-core processors. Conversely, academic projects face scalability challenges due to limited resources, highlighting the need for open-source frameworks that enable innovation and knowledge sharing. Recently, several open-source proposals have emerged, offering flexible and scalable designs, but fail to meet the performance demands of modern High-Performance Computing (HPC) applications. In this project, we present REPTILES, an open-source RISC-V multicore framework based on OpenPiton\thanks. REPTILES interconnects multiple Sargantana cores with the memory hierarchy of OpenPiton. Moreover, we present the new features incorporated in Sargantana and OpenPiton designs to improve the performance of HPC applications. We demonstrate that REPTILES presents suitable scalability, achieving a speedup of 3.1x on average with 4 cores. Additionally, we show that Sargantana's new features increase the performance of vector addition benchmark in a 9.3x.

54.0ARApr 29Code
EMiX: Emulating Beyond Single-FPGA Limits

Alexander Kropotov, Miquel Moreto, Behzad Salami

FPGA-level emulation is a key step in pre-silicon chip design validation. However, emulating large-scale multi-core systems increasingly exceed the hardware resource capacity of a single FPGA, limiting the feasibility of full-system emulation. To address this challenge, we introduce EMiX, a scalable multi-FPGA framework that enables distributed emulation of multi-core RISC-V architectures beyond single-FPGA resource limits. EMiX systematically partitions a monolithic multi-core design into multiple components and deploys them across multiple interconnected FPGAs, effectively exploiting inter-FPGA interconnects to balance scalability and performance without requiring fundamental RTL redesign. We prototype EMiX with a 64-core architecture across eight interconnected Alveo U55c FPGAs (scalable on core and FPGA counts), successfully demonstrating full-system execution including Linux boot. EMiX will be released as an open-source platform.

69.4ARApr 29
Verification and Validation (V&V)-in-the-Loop for RISC-V Design: The Holistic Vision of BZL

Sajjad Ahmed, Alexander Kropotov, Roberto Ignacio Genovese et al.

The Barcelona Zetascale Lab (BZL) project aims to strengthening Europe's capacity in the design and manufacture of RISC-V based high-performance computing chips. In this context, we present a holistic pre-silicon verification and validation (V&V) methodology targeting highly robust RISC-V chip designs. This paper provides an overview of BZL's V&V approach, which integrates three complementary platforms: (1) a UVM-based verification environment to thoroughly validate RTL functionality; (2) an FPGA-based validation platform that enables system-level pre-silicon hardware-software RTL validation; and (3) a CI/CD flow that continuously automates build, deployment, and tests across these domains. By embedding these platforms into an industrial-grade V&V loop and exploiting large-scale CPU and FPGA hardware infrastructures, the BZL project enables continuous evolution of reliable hardware development and software integration. We believe that the BZL's V&V flow represents a robust and scalable foundation for ensuring the pre-silicon functional correctness and system level validation of RISC-V chip designs, and can serve as a key enabler for strategic initiatives in Europe, such as EPI and DARE, and beyond.

LGMay 4, 2020
An Experimental Study of Reduced-Voltage Operation in Modern FPGAs for Neural Network Acceleration

Behzad Salami, Erhan Baturay Onural, Ismail Emir Yuksel et al.

We empirically evaluate an undervolting technique, i.e., underscaling the circuit supply voltage below the nominal level, to improve the power-efficiency of Convolutional Neural Network (CNN) accelerators mapped to Field Programmable Gate Arrays (FPGAs). Undervolting below a safe voltage level can lead to timing faults due to excessive circuit latency increase. We evaluate the reliability-power trade-off for such accelerators. Specifically, we experimentally study the reduced-voltage operation of multiple components of real FPGAs, characterize the corresponding reliability behavior of CNN accelerators, propose techniques to minimize the drawbacks of reduced-voltage operation, and combine undervolting with architectural CNN optimization techniques, i.e., quantization and pruning. We investigate the effect of environmental temperature on the reliability-power trade-off of such accelerators. We perform experiments on three identical samples of modern Xilinx ZCU102 FPGA platforms with five state-of-the-art image classification CNN benchmarks. This approach allows us to study the effects of our undervolting technique for both software and hardware variability. We achieve more than 3X power-efficiency (GOPs/W) gain via undervolting. 2.6X of this gain is the result of eliminating the voltage guardband region, i.e., the safe voltage region below the nominal level that is set by FPGA vendor to ensure correct functionality in worst-case environmental and circuit conditions. 43% of the power-efficiency gain is due to further undervolting below the guardband, which comes at the cost of accuracy loss in the CNN accelerator. We evaluate an effective frequency underscaling technique that prevents this accuracy loss, and find that it reduces the power-efficiency gain from 43% to 25%.

LGDec 26, 2019
On the Resilience of Deep Learning for Reduced-voltage FPGAs

Kamyar Givaki, Behzad Salami, Reza Hojabr et al.

Deep Neural Networks (DNNs) are inherently computation-intensive and also power-hungry. Hardware accelerators such as Field Programmable Gate Arrays (FPGAs) are a promising solution that can satisfy these requirements for both embedded and High-Performance Computing (HPC) systems. In FPGAs, as well as CPUs and GPUs, aggressive voltage scaling below the nominal level is an effective technique for power dissipation minimization. Unfortunately, bit-flip faults start to appear as the voltage is scaled down closer to the transistor threshold due to timing issues, thus creating a resilience issue. This paper experimentally evaluates the resilience of the training phase of DNNs in the presence of voltage underscaling related faults of FPGAs, especially in on-chip memories. Toward this goal, we have experimentally evaluated the resilience of LeNet-5 and also a specially designed network for CIFAR-10 dataset with different activation functions of Rectified Linear Unit (Relu) and Hyperbolic Tangent (Tanh). We have found that modern FPGAs are robust enough in extremely low-voltage levels and that low-voltage related faults can be automatically masked within the training iterations, so there is no need for costly software- or hardware-oriented fault mitigation techniques like ECC. Approximately 10% more training iterations are needed to fill the gap in the accuracy. This observation is the result of the relatively low rate of undervolting faults, i.e., <0.1\%, measured on real FPGA fabrics. We have also increased the fault rate significantly for the LeNet-5 network by randomly generated fault injection campaigns and observed that the training accuracy starts to degrade. When the fault rate increases, the network with Tanh activation function outperforms the one with Relu in terms of accuracy, e.g., when the fault rate is 30% the accuracy difference is 4.92%.

ARMar 29, 2019
Evaluating Built-in ECC of FPGA on-chip Memories for the Mitigation of Undervolting Faults

Behzad Salami, Osman S. Unsal, Adrian Cristal Kestelman

Voltage underscaling below the nominal level is an effective solution for improving energy efficiency in digital circuits, e.g., Field Programmable Gate Arrays (FPGAs). However, further undervolting below a safe voltage level and without accompanying frequency scaling leads to timing related faults, potentially undermining the energy savings. Through experimental voltage underscaling studies on commercial FPGAs, we observed that the rate of these faults exponentially increases for on-chip memories, or Block RAMs (BRAMs). To mitigate these faults, we evaluated the efficiency of the built-in Error-Correction Code (ECC) and observed that more than 90% of the faults are correctable and further 7% are detectable (but not correctable). This efficiency is the result of the single-bit type of these faults, which are then effectively covered by the Single-Error Correction and Double-Error Detection (SECDED) design of the built-in ECC. Finally, motivated by the above experimental observations, we evaluated an FPGA-based Neural Network (NN) accelerator under low-voltage operations, while built-in ECC is leveraged to mitigate undervolting faults and thus, prevent NN significant accuracy loss. In consequence, we achieve 40% of the BRAM power saving through undervolting below the minimum safe voltage level, with a negligible NN accuracy loss, thanks to the substantial fault coverage by the built-in ECC.

LGJun 14, 2018
On the Resilience of RTL NN Accelerators: Fault Characterization and Mitigation

Behzad Salami, Osman Unsal, Adrian Cristal

Machine Learning (ML) is making a strong resurgence in tune with the massive generation of unstructured data which in turn requires massive computational resources. Due to the inherently compute- and power-intensive structure of Neural Networks (NNs), hardware accelerators emerge as a promising solution. However, with technology node scaling below 10nm, hardware accelerators become more susceptible to faults, which in turn can impact the NN accuracy. In this paper, we study the resilience aspects of Register-Transfer Level (RTL) model of NN accelerators, in particular, fault characterization and mitigation. By following a High-Level Synthesis (HLS) approach, first, we characterize the vulnerability of various components of RTL NN. We observed that the severity of faults depends on both i) application-level specifications, i.e., NN data (inputs, weights, or intermediate), NN layers, and NN activation functions, and ii) architectural-level specifications, i.e., data representation model and the parallelism degree of the underlying accelerator. Second, motivated by characterization results, we present a low-overhead fault mitigation technique that can efficiently correct bit flips, by 47.3% better than state-of-the-art methods.