Babak Falsafi

LG
4papers
421citations
Novelty60%
AI Score29

4 Papers

ARMar 22, 2022
Scale-out Systolic Arrays

Ahmet Caner Yüzügüler, Canberk Sönmez, Mario Drumond et al.

Multi-pod systolic arrays are emerging as the architecture of choice in DNN inference accelerators. Despite their potential, designing multi-pod systolic arrays to maximize effective throughput/Watt (i.e., throughput/Watt adjusted when accounting for array utilization) poses a unique set of challenges. In this work, we study three key pillars in multi-pod systolic array designs, namely array granularity, interconnect, and tiling. We identify optimal array granularity across workloads and show that state-of-the-art commercial accelerators use suboptimal array sizes for single-tenancy workloads. We, then evaluate the bandwidth/latency trade-offs in interconnects and show that Butterfly networks offer a scalable topology for accelerators with a large number of pods. Finally, we introduce a novel data tiling scheme with custom partition size to maximize utilization in optimally sized pods. We propose Scale-out Systolic Arrays, a multi-pod inference accelerator for both single- and multi-tenancy based on these three pillars. We show that SOSA exhibits scaling of up to 600 TeraOps/s in effective throughput for state-of-the-art DNN inference workloads, and outperforms state-of-the-art multi-pod accelerators by a factor of 1.5x.

LGNov 19, 2022
Accuracy Booster: Enabling 4-bit Fixed-point Arithmetic for DNN Training

Simla Burcu Harma, Ayan Chakraborty, Nicholas Sperry et al.

The unprecedented demand for computing resources to train DNN models has led to a search for minimal numerical encoding. Recent state-of-the-art (SOTA) proposals advocate for multi-level scaled narrow bitwidth numerical formats. In this paper, we show that single-level scaling is sufficient to maintain training accuracy while maximizing arithmetic density. We identify a previously proposed single-level scaled format for 8-bit training, Hybrid Block Floating Point (HBFP), as the optimal candidate to minimize. We perform a full-scale exploration of the HBFP design space using mathematical tools to study the interplay among various parameters and identify opportunities for even smaller encodings across layers and epochs. Based on our findings, we propose Accuracy Booster, a mixed-mantissa HBFP technique that uses 4-bit mantissas for over 99% of all arithmetic operations in training and 6-bit mantissas only in the last epoch and first/last layers. We show Accuracy Booster enables increasing arithmetic density over all other SOTA formats by at least 2.3x while achieving state-of-the-art accuracies in 4-bit training.

CRMar 5, 2019
SMoTherSpectre: exploiting speculative execution through port contention

Atri Bhattacharyya, Alexandra Sandulescu, Matthias Neugschwandtner et al.

Spectre, Meltdown, and related attacks have demonstrated that kernels, hypervisors, trusted execution environments, and browsers are prone to information disclosure through micro-architectural weaknesses. However, it remains unclear as to what extent other applications, in particular those that do not load attacker-provided code, may be impacted. It also remains unclear as to what extent these attacks are reliant on cache-based side channels. We introduce SMoTherSpectre, a speculative code-reuse attack that leverages port-contention in simultaneously multi-threaded processors (SMoTher) as a side channel to leak information from a victim process. SMoTher is a fine-grained side channel that detects contention based on a single victim instruction. To discover real-world gadgets, we describe a methodology and build a tool that locates SMoTher-gadgets in popular libraries. In an evaluation on glibc, we found hundreds of gadgets that can be used to leak information. Finally, we demonstrate proof-of-concept attacks against the OpenSSH server, creating oracles for determining four host key bits, and against an application performing encryption using the OpenSSL library, creating an oracle which can differentiate a bit of the plaintext through gadgets in libcrypto and glibc.

LGApr 4, 2018
Training DNNs with Hybrid Block Floating Point

Mario Drumond, Tao Lin, Martin Jaggi et al.

The wide adoption of DNNs has given birth to unrelenting computing requirements, forcing datacenter operators to adopt domain-specific accelerators to train them. These accelerators typically employ densely packed full precision floating-point arithmetic to maximize performance per area. Ongoing research efforts seek to further increase that performance density by replacing floating-point with fixed-point arithmetic. However, a significant roadblock for these attempts has been fixed point's narrow dynamic range, which is insufficient for DNN training convergence. We identify block floating point (BFP) as a promising alternative representation since it exhibits wide dynamic range and enables the majority of DNN operations to be performed with fixed-point logic. Unfortunately, BFP alone introduces several limitations that preclude its direct applicability. In this work, we introduce HBFP, a hybrid BFP-FP approach, which performs all dot products in BFP and other operations in floating point. HBFP delivers the best of both worlds: the high accuracy of floating point at the superior hardware density of fixed point. For a wide variety of models, we show that HBFP matches floating point's accuracy while enabling hardware implementations that deliver up to 8.5x higher throughput.