DCSep 14, 2022
MGG: Accelerating Graph Neural Networks with Fine-grained intra-kernel Communication-Computation Pipelining on Multi-GPU PlatformsYuke Wang, Boyuan Feng, Zheng Wang et al. · deepmind
The increasing size of input graphs for graph neural networks (GNNs) highlights the demand for using multi-GPU platforms. However, existing multi-GPU GNN systems optimize the computation and communication individually based on the conventional practice of scaling dense DNNs. For irregularly sparse and fine-grained GNN workloads, such solutions miss the opportunity to jointly schedule/optimize the computation and communication operations for high-performance delivery. To this end, we propose MGG, a novel system design to accelerate full-graph GNNs on multi-GPU platforms. The core of MGG is its novel dynamic software pipeline to facilitate fine-grained computation-communication overlapping within a GPU kernel. Specifically, MGG introduces GNN-tailored pipeline construction and GPU-aware pipeline mapping to facilitate workload balancing and operation overlapping. MGG also incorporates an intelligent runtime design with analytical modeling and optimization heuristics to dynamically improve the execution performance. Extensive evaluation reveals that MGG outperforms state-of-the-art full-graph GNN systems across various settings: on average 4.41X, 4.81X, and 10.83X faster than DGL, MGG-UVM, and ROC, respectively.
ARNov 8, 2023
Evaluating Emerging AI/ML Accelerators: IPU, RDU, and NVIDIA/AMD GPUsHongwu Peng, Caiwen Ding, Tong Geng et al.
The relentless advancement of artificial intelligence (AI) and machine learning (ML) applications necessitates the development of specialized hardware accelerators capable of handling the increasing complexity and computational demands. Traditional computing architectures, based on the von Neumann model, are being outstripped by the requirements of contemporary AI/ML algorithms, leading to a surge in the creation of accelerators like the Graphcore Intelligence Processing Unit (IPU), Sambanova Reconfigurable Dataflow Unit (RDU), and enhanced GPU platforms. These hardware accelerators are characterized by their innovative data-flow architectures and other design optimizations that promise to deliver superior performance and energy efficiency for AI/ML tasks. This research provides a preliminary evaluation and comparison of these commercial AI/ML accelerators, delving into their hardware and software design features to discern their strengths and unique capabilities. By conducting a series of benchmark evaluations on common DNN operators and other AI/ML workloads, we aim to illuminate the advantages of data-flow architectures over conventional processor designs and offer insights into the performance trade-offs of each platform. The findings from our study will serve as a valuable reference for the design and performance expectations of research prototypes, thereby facilitating the development of next-generation hardware accelerators tailored for the ever-evolving landscape of AI/ML applications. Through this analysis, we aspire to contribute to the broader understanding of current accelerator technologies and to provide guidance for future innovations in the field.
LGDec 5, 2023
The Landscape of Modern Machine Learning: A Review of Machine, Distributed and Federated LearningOmer Subasi, Oceane Bel, Joseph Manzano et al.
With the advance of the powerful heterogeneous, parallel and distributed computing systems and ever increasing immense amount of data, machine learning has become an indispensable part of cutting-edge technology, scientific research and consumer products. In this study, we present a review of modern machine and deep learning. We provide a high-level overview for the latest advanced machine learning algorithms, applications, and frameworks. Our discussion encompasses parallel distributed learning, deep learning as well as federated learning. As a result, our work serves as an introductory text to the vast field of modern machine learning.
CRSep 17, 2021
Denial-of-Service Attack Detection via Differential Analysis of Generalized Entropy ProgressionsOmer Subasi, Joseph Manzano, Kevin Barker
Denial-of-Service (DoS) attacks are one of the most common and consequential cyber attacks in computer networks. While existing research offers a plethora of detection methods, the issue of achieving both scalability and high detection accuracy remains open. In this work, we address this problem by developing a differential method based on generalized entropy progression. In this method, we continuously fit the line of best fit to the entropy progression and check if the derivative, that is, the slope of this line is less than the negative of the dynamically computed standard deviation of the derivatives. As a result, we omit the usage of the thresholds and the results with five real-world network traffic datasets confirm that our method outperforms threshold-based DoS attack detection by two orders of magnitude on average. Our method achieves false positive rates that are up to 7% where the arithmetic mean is 3% with Tsallis entropy and only 5% sampling of the total network flow. Moreover, since the main computation cost of our method is the entropy computation, which is linear in the volume of the unit-time network flow and it uses integer only operations and a small fraction of the total flow, it is therefore lightweight and scalable.
CRNov 19, 2020
Leaky Buddies: Cross-Component Covert Channels on Integrated CPU-GPU SystemsSankha Baran Dutta, Hoda Naghibijouybari, Nael Abu-Ghazaleh et al.
Graphics Processing Units (GPUs) are a ubiquitous component across the range of today's computing platforms, from phones and tablets, through personal computers, to high-end server class platforms. With the increasing importance of graphics and video workloads, recent processors are shipped with GPU devices that are integrated on the same chip. Integrated GPUs share some resources with the CPU and as a result, there is a potential for microarchitectural attacks from the GPU to the CPU or vice versa. We believe this type of attack, crossing the component boundary (GPU to CPU or vice versa) is novel, introducing unique challenges, but also providing the attacker with new capabilities that must be considered when we design defenses against microarchitectrual attacks in these environments. Specifically, we consider the potential for covert channel attacks that arise either from shared microarchitectural components (such as caches) or through shared contention domains (e.g., shared buses). We illustrate these two types of channels by developing two reliable covert channel attacks. The first covert channel uses the shared LLC cache in Intel's integrated GPU architectures. The second is a contention based channel targeting the ring bus connecting the CPU and GPU to the LLC. Cross component channels introduce a number of new challenges that we had to overcome since they occur across heterogeneous components that use different computation models and are interconnected using asymmetric memory hierarchies. We also exploit GPU parallelism to increase the bandwidth of the communication, even without relying on a common clock. The LLC based channel achieves a bandwidth of 120 kbps with a low error rate of 2%, while the contention based channel delivers up to 400 kbps with a 0.8% error rate.