Omer Khan

CR
h-index16
8papers
154citations
Novelty59%
AI Score31

8 Papers

ARAug 22, 2023
Accel-GCN: High-Performance GPU Accelerator Design for Graph Convolution Networks

Xi Xie, Hongwu Peng, Amit Hasan et al.

Graph Convolutional Networks (GCNs) are pivotal in extracting latent information from graph data across various domains, yet their acceleration on mainstream GPUs is challenged by workload imbalance and memory access irregularity. To address these challenges, we present Accel-GCN, a GPU accelerator architecture for GCNs. The design of Accel-GCN encompasses: (i) a lightweight degree sorting stage to group nodes with similar degree; (ii) a block-level partition strategy that dynamically adjusts warp workload sizes, enhancing shared memory locality and workload balance, and reducing metadata overhead compared to designs like GNNAdvisor; (iii) a combined warp strategy that improves memory coalescing and computational parallelism in the column dimension of dense matrices. Utilizing these principles, we formulated a kernel for sparse matrix multiplication (SpMM) in GCNs that employs block-level partitioning and combined warp strategy. This approach augments performance and multi-level memory efficiency and optimizes memory bandwidth by exploiting memory coalescing and alignment. Evaluation of Accel-GCN across 18 benchmark graphs reveals that it outperforms cuSPARSE, GNNAdvisor, and graph-BLAST by factors of 1.17 times, 1.86 times, and 2.94 times respectively. The results underscore Accel-GCN as an effective solution for enhancing GCN computational efficiency.

LGSep 11, 2022
Towards Sparsification of Graph Neural Networks

Hongwu Peng, Deniz Gurevin, Shaoyi Huang et al.

As real-world graphs expand in size, larger GNN models with billions of parameters are deployed. High parameter count in such models makes training and inference on graphs expensive and challenging. To reduce the computational and memory costs of GNNs, optimization methods such as pruning the redundant nodes and edges in input graphs have been commonly adopted. However, model compression, which directly targets the sparsification of model layers, has been mostly limited to traditional Deep Neural Networks (DNNs) used for tasks such as image classification and object detection. In this paper, we utilize two state-of-the-art model compression methods (1) train and prune and (2) sparse training for the sparsification of weight layers in GNNs. We evaluate and compare the efficiency of both methods in terms of accuracy, training sparsity, and training FLOPs on real-world graphs. Our experimental results show that on the ia-email, wiki-talk, and stackoverflow datasets for link prediction, sparse training with much lower training FLOPs achieves a comparable accuracy with the train and prune method. On the brain dataset for node classification, sparse training uses a lower number FLOPs (less than 1/7 FLOPs of train and prune method) and preserves a much better accuracy performance under extreme model sparsity.

LGOct 8, 2022
Towards Real-Time Temporal Graph Learning

Deniz Gurevin, Mohsin Shan, Tong Geng et al.

In recent years, graph representation learning has gained significant popularity, which aims to generate node embeddings that capture features of graphs. One of the methods to achieve this is employing a technique called random walks that captures node sequences in a graph and then learns embeddings for each node using a natural language processing technique called Word2Vec. These embeddings are then used for deep learning on graph data for classification tasks, such as link prediction or node classification. Prior work operates on pre-collected temporal graph data and is not designed to handle updates on a graph in real-time. Real world graphs change dynamically and their entire temporal updates are not available upfront. In this paper, we propose an end-to-end graph learning pipeline that performs temporal graph construction, creates low-dimensional node embeddings, and trains multi-layer neural network models in an online setting. The training of the neural network models is identified as the main performance bottleneck as it performs repeated matrix operations on many sequentially connected low-dimensional kernels. We propose to unlock fine-grain parallelism in these low-dimensional kernels to boost performance of model training.

LGDec 14, 2023
MaxK-GNN: Extremely Fast GPU Kernel Design for Accelerating Graph Neural Networks Training

Hongwu Peng, Xi Xie, Kaustubh Shivdikar et al.

In the acceleration of deep neural network training, the GPU has become the mainstream platform. GPUs face substantial challenges on GNNs, such as workload imbalance and memory access irregularities, leading to underutilized hardware. Existing solutions such as PyG, DGL with cuSPARSE, and GNNAdvisor frameworks partially address these challenges but memory traffic is still significant. We argue that drastic performance improvements can only be achieved by the vertical optimization of algorithm and system innovations, rather than treating the speedup optimization as an "after-thought" (i.e., (i) given a GNN algorithm, designing an accelerator, or (ii) given hardware, mainly optimizing the GNN algorithm). In this paper, we present MaxK-GNN, an advanced high-performance GPU training system integrating algorithm and system innovation. (i) We introduce the MaxK nonlinearity and provide a theoretical analysis of MaxK nonlinearity as a universal approximator, and present the Compressed Balanced Sparse Row (CBSR) format, designed to store the data and index of the feature matrix after nonlinearity; (ii) We design a coalescing enhanced forward computation with row-wise product-based SpGEMM Kernel using CBSR for input feature matrix fetching and strategic placement of a sparse output accumulation buffer in shared memory; (iii) We develop an optimized backward computation with outer product-based and SSpMM Kernel. We conduct extensive evaluations of MaxK-GNN and report the end-to-end system run-time. Experiments show that MaxK-GNN system could approach the theoretical speedup limit according to Amdahl's law. We achieve comparable accuracy to SOTA GNNs, but at a significantly increased speed: 3.22/4.24 times speedup (vs. theoretical limits, 5.52/7.27 times) on Reddit compared to DGL and GNNAdvisor implementations.

DCNov 25, 2024
OPMOS: Ordered Parallel Algorithm for Multi-Objective Shortest-Paths

Leo Gold, Adam Bienkowski, David Sidoti et al.

The Multi-Objective Shortest-Path (MOS) problem finds a set of Pareto-optimal solutions from a start node to a destination node in a multi-attribute graph. The literature explores multi-objective A*-style algorithmic approaches to solving the NP-hard MOS problem. These approaches use consistent heuristics to compute an exact set of solutions for the goal node. A generalized MOS algorithm maintains a "frontier" of partial paths at each node and performs ordered processing to ensure that Pareto-optimal paths are generated to reach the goal node. The algorithm becomes computationally intractable at a higher number of objectives due to a rapid increase in the search space for non-dominated paths and the significant increase in Pareto-optimal solutions. While prior works have focused on algorithmic methods to reduce the complexity, we tackle this challenge by exploiting parallelism to accelerate the MOS problem. The key insight is that MOS algorithms rely on the ordered execution of partial paths to maintain high work efficiency. The proposed parallel algorithm (OPMOS) unlocks ordered parallelism and efficiently exploits the concurrent execution of multiple paths in MOS. Experimental evaluation using the NVIDIA GH200 Superchip's 72-core Arm-based CPU shows the performance scaling potential of OPMOS on work efficiency and parallelism using a real-world application to ship routing.

CRJan 5, 2022
Secure Remote Attestation with Strong Key Insulation Guarantees

Deniz Gurevin, Chenglu Jin, Phuong Ha Nguyen et al.

Recent years have witnessed a trend of secure processor design in both academia and industry. Secure processors with hardware-enforced isolation can be a solid foundation of cloud computation in the future. However, due to recent side-channel attacks, the commercial secure processors failed to deliver the promises of a secure isolated execution environment. Sensitive information inside the secure execution environment always gets leaked via side channels. This work considers the most powerful software-based side-channel attackers, i.e., an All Digital State Observing (ADSO) adversary who can observe all digital states, including all digital states in secure enclaves. Traditional signature schemes are not secure in ADSO adversarial model. We introduce a new cryptographic primitive called One-Time Signature with Secret Key Exposure (OTS-SKE), which ensures no one can forge a valid signature of a new message or nonce even if all secret session keys are leaked. OTS-SKE enables us to sign attestation reports securely under the ADSO adversary. We also minimize the trusted computing base by introducing a secure co-processor into the system, and the interaction between the secure co-processor and the attestation processor is unidirectional. That is, the co-processor takes no inputs from the processor and only generates secret keys for the processor to fetch. Our experimental results show that the signing of OTS-SKE is faster than that of Elliptic Curve Digital Signature Algorithm (ECDSA) used in Intel SGX.

CRApr 29, 2019
IRONHIDE: A Secure Multicore that Efficiently Mitigates Microarchitecture State Attacks for Interactive Applications

Hamza Omar, Omer Khan

Microprocessors enable aggressive hardware virtualization by means of which multiple processes temporally execute on the system. These security-critical and ordinary processes interact with each other to assure application progress. However, temporal sharing of hardware resources exposes the processor to various microarchitecture state attacks. State-of-the-art secure processors, such as MI6 adopt Intel's SGX enclave execution model. MI6 architects strong isolation by statically isolating shared memory state, and purging the microarchitecture state of private core, cache, and TLB resources on every enclave entry and exit. The purging overhead significantly impacts performance as the interactivity across the secure and insecure processes increases. This paper proposes IRONHIDE that implements strong isolation in the context of multicores to form spatially isolated secure and insecure clusters of cores. For an interactive application comprising of secure and insecure processes, IRONHIDE pins the secure process(es) to the secure cluster, where they execute and interact with the insecure process(es) without incurring the microarchitecture state purging overheads on every interaction event. IRONHIDE improves performance by 2.1x over the MI6 baseline for a set of user and OS interactive applications. Moreover, IRONHIDE improves performance by 20% over an SGX-like baseline, while also ensuring strong isolation guarantees against microarchitecture state attacks.

CRJun 12, 2017
Revisiting Definitional Foundations of Oblivious RAM for Secure Processor Implementations

Syed Kamran Haider, Omer Khan, Marten van Dijk

Oblivious RAM (ORAM) is a renowned technique to hide the access patterns of an application to an untrusted memory. According to the standard ORAM definition presented by Goldreich and Ostrovsky, two ORAM access sequences must be computationally indistinguishable if the lengths of these sequences are identically distributed. An artifact of this definition is that it does not apply to modern ORAM implementations adapted in current secure processors technology because of their arbitrary lengths of memory access sequences depending on programs' behaviors (their termination times). As a result, the ORAM definition does not directly apply; the theoretical foundations of ORAM do not clearly argue about the timing and termination channels. This paper conducts a first rigorous study of the standard Goldreich-Ostrovsky ORAM definition in view of modern practical ORAMs (e.g., Path ORAM) and demonstrates the gap between theoretical foundations and real implementations. A new ORAM formulation which clearly separates out termination channel leakage is proposed. It is shown how this definition implies the standard ORAM definition (for finite length input access sequences) and better fits the modern practical ORAM implementations. The proposed definition relaxes the constraints around the stash size and overflow probability for Path ORAM, and essentially transforms its security argument into a performance consideration problem. Finally, a `strong' ORAM formulation which clearly includes obfuscation of termination leakage is shown to imply our new ORAM formulation and applies to ORAM for outsourced disk storage. In this strong formulation constraints are not relaxed and the security argument for Path ORAM remains complex as one needs to prove that the stash overflows with negligible probability.