Vaughn Betz

AR
3papers
62citations
Novelty45%
AI Score43

3 Papers

49.1ARJun 4Code
Modeling, Optimizing and Exploring Multi-Die FPGA Routing Architectures

Amirhossein Poolad, Soheil Gholami Shahrouz, Andrew Boutros et al.

Die stacking has enabled 2.5D FPGAs by integrating multiple active dice on a passive silicon interposer for improved yield and capacity, and paved the way for 3D architectures that stack active dice directly atop one another. In these multi-die devices, the unique electrical and physical characteristics of the underlying die-stacking technology impose limitations on inter-die connection density and latency, necessitating a bespoke inter-die routing architecture. However, the absence of accurate and versatile modeling tools has left most questions about how to best design the inter-die routing architecture unanswered. To address this gap, we enhance the open-source FPGA CAD tool VTR to flexibly model a wide range of multi-die routing architectures, and augment VPR's placement and routing engines to improve optimization for both 2.5D and 3D FPGAs. We perform HSPICE-based circuit modeling of inter-die connections for active dice using a 7 nm process node and a 45 nm silicon interposer across several die-crossing technologies. Using this enhanced framework, we conduct a detailed design space exploration of inter-die routing architecture in 2.5D and 3D FPGAs, characterizing the impact of die-crossing technology, inter-die connection count, fan-in/fan-out, and interposer wire length on critical path delay (CPD), wirelength, area, and routability. Our results show that with suitable inter-die routing architectures, 2.5D and 3D FPGAs can increase capacity without significant routability or delay penalties. Specifically, 3D FPGAs achieve up to 14% wirelength reduction and 6% CPD improvement over 2D devices, and remain routable even with existing $10\,μ$m pitch technologies, while 2.5D FPGAs incur only a 2% wirelength and 4% CPD overhead at 32% inter-die connectivity. All extensions are open source and integrated with the VTR master branch.

ARAug 17, 2024
H2PIPE: High throughput CNN Inference on FPGAs with High-Bandwidth Memory

Mario Doumet, Marius Stan, Mathew Hall et al.

Convolutional Neural Networks (CNNs) combine large amounts of parallelizable computation with frequent memory access. Field Programmable Gate Arrays (FPGAs) can achieve low latency and high throughput CNN inference by implementing dataflow accelerators that pipeline layer-specific hardware to implement an entire network. By implementing a different processing element for each CNN layer, these layer-pipelined accelerators can achieve high compute density, but having all layers processing in parallel requires high memory bandwidth. Traditionally this has been satisfied by storing all weights on chip, but this is infeasible for the largest CNNs, which are often those most in need of acceleration. In this work we augment a state-of-the-art dataflow accelerator (HPIPE) to leverage both High-Bandwidth Memory (HBM) and on-chip storage, enabling high performance layer-pipelined dataflow acceleration of large CNNs. Based on profiling results of HBM's latency and throughput against expected address patterns, we develop an algorithm to choose which weight buffers should be moved off chip and how deep the on-chip FIFOs to HBM should be to minimize compute unit stalling. We integrate the new hardware generation within the HPIPE domain-specific CNN compiler and demonstrate good bandwidth efficiency against theoretical limits. Compared to the best prior work we obtain speed-ups of at least 19.4x, 5.1x and 10.5x on ResNet-18, ResNet-50 and VGG-16 respectively.

CRDec 14, 2020
Neighbors From Hell: Voltage Attacks Against Deep Learning Accelerators on Multi-Tenant FPGAs

Andrew Boutros, Mathew Hall, Nicolas Papernot et al.

Field-programmable gate arrays (FPGAs) are becoming widely used accelerators for a myriad of datacenter applications due to their flexibility and energy efficiency. Among these applications, FPGAs have shown promising results in accelerating low-latency real-time deep learning (DL) inference, which is becoming an indispensable component of many end-user applications. With the emerging research direction towards virtualized cloud FPGAs that can be shared by multiple users, the security aspect of FPGA-based DL accelerators requires careful consideration. In this work, we evaluate the security of DL accelerators against voltage-based integrity attacks in a multitenant FPGA scenario. We first demonstrate the feasibility of such attacks on a state-of-the-art Stratix 10 card using different attacker circuits that are logically and physically isolated in a separate attacker role, and cannot be flagged as malicious circuits by conventional bitstream checkers. We show that aggressive clock gating, an effective power-saving technique, can also be a potential security threat in modern FPGAs. Then, we carry out the attack on a DL accelerator running ImageNet classification in the victim role to evaluate the inherent resilience of DL models against timing faults induced by the adversary. We find that even when using the strongest attacker circuit, the prediction accuracy of the DL accelerator is not compromised when running at its safe operating frequency. Furthermore, we can achieve 1.18-1.31x higher inference performance by over-clocking the DL accelerator without affecting its prediction accuracy.