Andrew Boutros

4papers

75citations

Novelty44%

AI Score45

Ranked #65,894 of 201,326 authors (top 33%)#215 in AR (top 28%)

4 Papers

55.8ARJun 4Code

Modeling, Optimizing and Exploring Multi-Die FPGA Routing Architectures

Amirhossein Poolad, Soheil Gholami Shahrouz, Andrew Boutros et al.

Die stacking has enabled 2.5D FPGAs by integrating multiple active dice on a passive silicon interposer for improved yield and capacity, and paved the way for 3D architectures that stack active dice directly atop one another. In these multi-die devices, the unique electrical and physical characteristics of the underlying die-stacking technology impose limitations on inter-die connection density and latency, necessitating a bespoke inter-die routing architecture. However, the absence of accurate and versatile modeling tools has left most questions about how to best design the inter-die routing architecture unanswered. To address this gap, we enhance the open-source FPGA CAD tool VTR to flexibly model a wide range of multi-die routing architectures, and augment VPR's placement and routing engines to improve optimization for both 2.5D and 3D FPGAs. We perform HSPICE-based circuit modeling of inter-die connections for active dice using a 7 nm process node and a 45 nm silicon interposer across several die-crossing technologies. Using this enhanced framework, we conduct a detailed design space exploration of inter-die routing architecture in 2.5D and 3D FPGAs, characterizing the impact of die-crossing technology, inter-die connection count, fan-in/fan-out, and interposer wire length on critical path delay (CPD), wirelength, area, and routability. Our results show that with suitable inter-die routing architectures, 2.5D and 3D FPGAs can increase capacity without significant routability or delay penalties. Specifically, 3D FPGAs achieve up to 14% wirelength reduction and 6% CPD improvement over 2D devices, and remain routable even with existing $10\,μ$m pitch technologies, while 2.5D FPGAs incur only a 2% wirelength and 4% CPD overhead at 32% inter-die connectivity. All extensions are open source and integrated with the VTR master branch.

DCApr 22, 2022

FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems

Rui Ma, Evangelos Georganas, Alexander Heinecke et al.

Rapid advances in artificial intelligence (AI) technology have led to significant accuracy improvements in a myriad of application domains at the cost of larger and more compute-intensive models. Training such models on massive amounts of data typically requires scaling to many compute nodes and relies heavily on collective communication algorithms, such as all-reduce, to exchange the weight gradients between different nodes. The overhead of these collective communication operations in a distributed AI training system can bottleneck its performance, with more pronounced effects as the number of nodes increases. In this paper, we first characterize the all-reduce operation overhead by profiling distributed AI training. Then, we propose a new smart network interface card (NIC) for distributed AI training systems using field-programmable gate arrays (FPGAs) to accelerate all-reduce operations and optimize network bandwidth utilization via data compression. The AI smart NIC frees up the system's compute resources to perform the more compute-intensive tensor operations and increases the overall node-to-node communication efficiency. We perform real measurements on a prototype distributed AI training system comprised of 6 compute nodes to evaluate the performance gains of our proposed FPGA-based AI smart NIC compared to a baseline system with regular NICs. We also use these measurements to validate an analytical model that we formulate to predict performance when scaling to larger systems. Our proposed FPGA-based AI smart NIC enhances overall training performance by 1.6x at 6 nodes, with an estimated 2.5x performance improvement at 32 nodes, compared to the baseline system using conventional NICs.

34.2ARApr 27

Déjà Vu Packing: Optimizing FPGA Logic Clustering Runtime via Pattern Memoization

Milo Liebster, Amin Mohaghegh, Andrew Boutros

Implementing a digital circuit on an FPGA fabric requires clustering technology-mapped netlist primitives into coarser-granularity blocks that can be directly mapped to the physical resources available on the FPGA. As the architecture of FPGA logic blocks (LBs) has grown in complexity, with sophisticated logic elements (LEs) and highly irregular local interconnect, this packing problem has become more challenging. To ensure the feasibility of intracluster routing, the computer-aided design (CAD) tools must solve a costly multi-source multi-sink routing problem for each candidate cluster. In this paper, we first show that such packing legality checks consume a significant portion of the CAD flow runtime for LB architectures with complex LEs and local routing structures resembling modern commercial FPGAs. We demonstrate that the packing stage constitutes 58% and 94% of the entire Versatile Place and Route (VPR) flow runtime on average when mapping a wide variety of benchmarks to the AMD 7-series-like and Altera Stratix-10-like VTR architecture captures, respectively. By analyzing the packing algorithm behavior, we observe that a significant fraction of the attempted packed clusters are repetitions of a much smaller number of packing patterns, and therefore many of the packing legality checks are redundant and could be skipped. To this end, we introduce our Déjà Vu packing approach, which leverages a novel packing signature tree data structure that enables efficient identification of recurring packing patterns and memoization of their legality check outcomes. Our approach speeds up the packing by up to 13.4x and 29.3x, with an average of 3.7x and 6.9x, across the evaluated benchmarks on the 7-series and Stratix 10 architectures. These packing runtime gains result in a significant 1.6x and 5.3x average reduction in end-to-end VPR runtime, while maintaining quality of results.

CRDec 14, 2020

Neighbors From Hell: Voltage Attacks Against Deep Learning Accelerators on Multi-Tenant FPGAs

Andrew Boutros, Mathew Hall, Nicolas Papernot et al.

Field-programmable gate arrays (FPGAs) are becoming widely used accelerators for a myriad of datacenter applications due to their flexibility and energy efficiency. Among these applications, FPGAs have shown promising results in accelerating low-latency real-time deep learning (DL) inference, which is becoming an indispensable component of many end-user applications. With the emerging research direction towards virtualized cloud FPGAs that can be shared by multiple users, the security aspect of FPGA-based DL accelerators requires careful consideration. In this work, we evaluate the security of DL accelerators against voltage-based integrity attacks in a multitenant FPGA scenario. We first demonstrate the feasibility of such attacks on a state-of-the-art Stratix 10 card using different attacker circuits that are logically and physically isolated in a separate attacker role, and cannot be flagged as malicious circuits by conventional bitstream checkers. We show that aggressive clock gating, an effective power-saving technique, can also be a potential security threat in modern FPGAs. Then, we carry out the attack on a DL accelerator running ImageNet classification in the victim role to evaluate the inherent resilience of DL models against timing faults induced by the adversary. We find that even when using the strongest attacker circuit, the prediction accuracy of the DL accelerator is not compromised when running at its safe operating frequency. Furthermore, we can achieve 1.18-1.31x higher inference performance by over-clocking the DL accelerator without affecting its prediction accuracy.