ARJun 27, 2023
A Survey on Deep Learning Hardware Accelerators for Heterogeneous HPC PlatformsCristina Silvano, Daniele Ielmini, Fabrizio Ferrandi et al.
Recent trends in deep learning (DL) have made hardware accelerators essential for various high-performance computing (HPC) applications, including image classification, computer vision, and speech recognition. This survey summarizes and classifies the most recent developments in DL accelerators, focusing on their role in meeting the performance demands of HPC applications. We explore cutting-edge approaches to DL acceleration, covering not only GPU- and TPU-based platforms but also specialized hardware such as FPGA- and ASIC-based accelerators, Neural Processing Units, open hardware RISC-V-based accelerators, and co-processors. This survey also describes accelerators leveraging emerging memory technologies and computing paradigms, including 3D-stacked Processor-In-Memory, non-volatile memories like Resistive RAM and Phase Change Memories used for in-memory computing, as well as Neuromorphic Processing Units, and Multi-Chip Module-based accelerators. Furthermore, we provide insights into emerging quantum-based accelerators and photonics. Finally, this survey categorizes the most influential architectures and technologies from recent years, offering readers a comprehensive perspective on the rapidly evolving field of deep learning acceleration.
ARNov 29, 2023
A Survey on Design Methodologies for Accelerating Deep Learning on Heterogeneous ArchitecturesSerena Curzel, Fabrizio Ferrandi, Leandro Fiorin et al.
Given their increasing size and complexity, the need for efficient execution of deep neural networks has become increasingly pressing in the design of heterogeneous High-Performance Computing (HPC) and edge platforms, leading to a wide variety of proposals for specialized deep learning architectures and hardware accelerators. The design of such architectures and accelerators requires a multidisciplinary approach combining expertise from several areas, from machine learning to computer architecture, low-level hardware design, and approximate computing. Several methodologies and tools have been proposed to improve the process of designing accelerators for deep learning, aimed at maximizing parallelism and minimizing data movement to achieve high performance and energy efficiency. This paper critically reviews influential tools and design methodologies for Deep Learning accelerators, offering a wide perspective in this rapidly evolving field. This work complements surveys on architectures and accelerators by covering hardware-software co-design, automated synthesis, domain-specific compilers, design space exploration, modeling, and simulation, providing insights into technical challenges and open research directions.
35.7DCApr 10
MATCHA: Efficient Deployment of Deep Neural Networks on Multi-Accelerator Heterogeneous Edge SoCsEnrico Russo, Mohamed Amine Hamdi, Alessandro Ottaviano et al.
Deploying DNNs on System-on-Chips (SoC) with multiple heterogeneous acceleration engines is challenging, and the majority of deployment frameworks cannot fully exploit heterogeneity. We present MATCHA, a unified DNN deployment framework that generates highly concurrent schedules for parallel, heterogeneous accelerators and uses constraint programming to optimize L3/L2 memory allocation and scheduling. Using pattern matching, tiling, and mapping across individual HW units enables parallel execution and high accelerator utilization. On the MLPerf Tiny benchmark, using a SoC with two heterogeneous accelerators, MATCHA improves accelerator utilization and reduces inference latency by up to 35% with respect to the the state-of-the-art MATCH compiler.
ARApr 13, 2024
Deep Reinforcement Learning based Online Scheduling Policy for Deep Neural Network Multi-Tenant Multi-Accelerator SystemsFrancesco G. Blanco, Enrico Russo, Maurizio Palesi et al.
Currently, there is a growing trend of outsourcing the execution of DNNs to cloud services. For service providers, managing multi-tenancy and ensuring high-quality service delivery, particularly in meeting stringent execution time constraints, assumes paramount importance, all while endeavoring to maintain cost-effectiveness. In this context, the utilization of heterogeneous multi-accelerator systems becomes increasingly relevant. This paper presents RELMAS, a low-overhead deep reinforcement learning algorithm designed for the online scheduling of DNNs in multi-tenant environments, taking into account the dataflow heterogeneity of accelerators and memory bandwidths contentions. By doing so, service providers can employ the most efficient scheduling policy for user requests, optimizing Service-Level-Agreement (SLA) satisfaction rates and enhancing hardware utilization. The application of RELMAS to a heterogeneous multi-accelerator system composed of various instances of Simba and Eyeriss sub-accelerators resulted in up to a 173% improvement in SLA satisfaction rate compared to state-of-the-art scheduling techniques across different workload scenarios, with less than a 1.5% energy overhead.
ARFeb 9, 2024
Towards Fair and Firm Real-Time Scheduling in DNN Multi-Tenant Multi-Accelerator Systems via Reinforcement LearningEnrico Russo, Francesco Giulio Blanco, Maurizio Palesi et al.
This paper addresses the critical challenge of managing Quality of Service (QoS) in cloud services, focusing on the nuances of individual tenant expectations and varying Service Level Indicators (SLIs). It introduces a novel approach utilizing Deep Reinforcement Learning for tenant-specific QoS management in multi-tenant, multi-accelerator cloud environments. The chosen SLI, deadline hit rate, allows clients to tailor QoS for each service request. A novel online scheduling algorithm for Deep Neural Networks in multi-accelerator systems is proposed, with a focus on guaranteeing tenant-wise, model-specific QoS levels while considering real-time constraints.
LGNov 25, 2024
A Data-Driven Approach to Dataflow-Aware Online Scheduling for Graph Neural Network InferencePol Puigdemont, Enrico Russo, Axel Wassington et al.
Graph Neural Networks (GNNs) have shown significant promise in various domains, such as recommendation systems, bioinformatics, and network analysis. However, the irregularity of graph data poses unique challenges for efficient computation, leading to the development of specialized GNN accelerator architectures that surpass traditional CPU and GPU performance. Despite this, the structural diversity of input graphs results in varying performance across different GNN accelerators, depending on their dataflows. This variability in performance due to differing dataflows and graph properties remains largely unexplored, limiting the adaptability of GNN accelerators. To address this, we propose a data-driven framework for dataflow-aware latency prediction in GNN inference. Our approach involves training regressors to predict the latency of executing specific graphs on particular dataflows, using simulations on synthetic graphs. Experimental results indicate that our regressors can predict the optimal dataflow for a given graph with up to 91.28% accuracy and a Mean Absolute Percentage Error (MAPE) of 3.78%. Additionally, we introduce an online scheduling algorithm that uses these regressors to enhance scheduling decisions. Our experiments demonstrate that this algorithm achieves up to $3.17\times$ speedup in mean completion time and $6.26\times$ speedup in mean execution time compared to the best feasible baseline across all datasets.
QUANT-PHJun 17, 2024
Attention-Based Deep Reinforcement Learning for Qubit Allocation in Modular Quantum ArchitecturesEnrico Russo, Maurizio Palesi, Davide Patti et al.
Modular, distributed and multi-core architectures are currently considered a promising approach for scalability of quantum computing systems. The integration of multiple Quantum Processing Units necessitates classical and quantum-coherent communication, introducing challenges related to noise and quantum decoherence in quantum state transfers between cores. Optimizing communication becomes imperative, and the compilation and mapping of quantum circuits onto physical qubits must minimize state transfers while adhering to architectural constraints. The compilation process, inherently an NP-hard problem, demands extensive search times even with a small number of qubits to be solved to optimality. To address this challenge efficiently, we advocate for the utilization of heuristic mappers that can rapidly generate solutions. In this work, we propose a novel approach employing Deep Reinforcement Learning (DRL) methods to learn these heuristics for a specific multi-core architecture. Our DRL agent incorporates a Transformer encoder and Graph Neural Networks. It encodes quantum circuits using self-attention mechanisms and produce outputs through an attention-based pointer mechanism that directly signifies the probability of matching logical qubits with physical cores. This enables the selection of optimal cores for logical qubits efficiently. Experimental evaluations show that the proposed method can outperform baseline approaches in terms of reducing inter-core communications and minimizing online time-to-solution. This research contributes to the advancement of scalable quantum computing systems by introducing a novel learning-based heuristic approach for efficient quantum circuit compilation and mapping.