Vaughn Betz

h-index39

3papers

8,300citations

3 Papers

11.8ARJul 15Code

Jack of All Scales: A Versatile FPGA Tensor Block for MXFP Precisions

Marwan Mekhemer, Ahmed Elsousy, Balaji Venkatesh et al.

Modern deep learning workloads increasingly rely on narrow numerical formats to improve efficiency and reduce memory footprint. The recently standardized microscaling floating-point (MXFP) family of formats, including MXFP8, MXFP6, and MXFP4, offers a practical approach to low-precision inference, yet the digital signal processing (DSP) blocks in current FPGA architectures offer limited native support for these formats. In this work, we first present a comprehensive characterization of MXFP dot product implementations on Altera Agilex-5 FPGAs, exploring a range of strategies spanning pure soft logic, DSP blocks in fixed-point, floating-point, and tensor modes. Our results show that while the tensor mode delivers the highest arithmetic density for MXFP4 (E2M1) and MXFP6 (E2M3), it cannot implement MXFP6 (E3M2) or any MXFP8 precisions, forcing designers to fall back to lower-density alternatives. Motivated by this gap, we propose targeted modifications to the DSP block's internal tensor-mode architecture that enable native support for all MXFP precisions while retaining backward compatibility. We estimate the area cost of these modifications using a simplified version of the Agilex-5 DSP block core implemented using the open-source ASAP7 PDK. We evaluate a variety of modified DSP block designs that present a tradeoff between format coverage, arithmetic density, and area overhead. Our preferred design point increases the DSP tile area by 36%, corresponding to only 1.8\% of the total FPGA die area. We evaluate the device-level impact of our enhanced DSP block by comparing systolic array matrix multiplier implementations across all MXFP precisions, contrasting the best-available strategies on the existing architecture against designs leveraging our modified DSP block. Our results demonstrate an average throughput improvement of 4.2x across all supported MXFP formats.

5.1ARAug 17, 2024

H2PIPE: High throughput CNN Inference on FPGAs with High-Bandwidth Memory

Mario Doumet, Marius Stan, Mathew Hall et al.

Convolutional Neural Networks (CNNs) combine large amounts of parallelizable computation with frequent memory access. Field Programmable Gate Arrays (FPGAs) can achieve low latency and high throughput CNN inference by implementing dataflow accelerators that pipeline layer-specific hardware to implement an entire network. By implementing a different processing element for each CNN layer, these layer-pipelined accelerators can achieve high compute density, but having all layers processing in parallel requires high memory bandwidth. Traditionally this has been satisfied by storing all weights on chip, but this is infeasible for the largest CNNs, which are often those most in need of acceleration. In this work we augment a state-of-the-art dataflow accelerator (HPIPE) to leverage both High-Bandwidth Memory (HBM) and on-chip storage, enabling high performance layer-pipelined dataflow acceleration of large CNNs. Based on profiling results of HBM's latency and throughput against expected address patterns, we develop an algorithm to choose which weight buffers should be moved off chip and how deep the on-chip FIFOs to HBM should be to minimize compute unit stalling. We integrate the new hardware generation within the HPIPE domain-specific CNN compiler and demonstrate good bandwidth efficiency against theoretical limits. Compared to the best prior work we obtain speed-ups of at least 19.4x, 5.1x and 10.5x on ResNet-18, ResNet-50 and VGG-16 respectively.

11.5CRDec 14, 2020

Neighbors From Hell: Voltage Attacks Against Deep Learning Accelerators on Multi-Tenant FPGAs

Andrew Boutros, Mathew Hall, Nicolas Papernot et al.

Field-programmable gate arrays (FPGAs) are becoming widely used accelerators for a myriad of datacenter applications due to their flexibility and energy efficiency. Among these applications, FPGAs have shown promising results in accelerating low-latency real-time deep learning (DL) inference, which is becoming an indispensable component of many end-user applications. With the emerging research direction towards virtualized cloud FPGAs that can be shared by multiple users, the security aspect of FPGA-based DL accelerators requires careful consideration. In this work, we evaluate the security of DL accelerators against voltage-based integrity attacks in a multitenant FPGA scenario. We first demonstrate the feasibility of such attacks on a state-of-the-art Stratix 10 card using different attacker circuits that are logically and physically isolated in a separate attacker role, and cannot be flagged as malicious circuits by conventional bitstream checkers. We show that aggressive clock gating, an effective power-saving technique, can also be a potential security threat in modern FPGAs. Then, we carry out the attack on a DL accelerator running ImageNet classification in the victim role to evaluate the inherent resilience of DL models against timing faults induced by the adversary. We find that even when using the strongest attacker circuit, the prediction accuracy of the DL accelerator is not compromised when running at its safe operating frequency. Furthermore, we can achieve 1.18-1.31x higher inference performance by over-clocking the DL accelerator without affecting its prediction accuracy.