DCMar 15, 2023
MCR-DL: Mix-and-Match Communication Runtime for Deep LearningQuentin Anthony, Ammar Ahmad Awan, Jeff Rasley et al.
In recent years, the training requirements of many state-of-the-art Deep Learning (DL) models have scaled beyond the compute and memory capabilities of a single processor, and necessitated distribution among processors. Training such massive models necessitates advanced parallelism strategies to maintain efficiency. However, such distributed DL parallelism strategies require a varied mixture of collective and point-to-point communication operations across a broad range of message sizes and scales. Examples of models using advanced parallelism strategies include Deep Learning Recommendation Models (DLRM) and Mixture-of-Experts (MoE). Communication libraries' performance varies wildly across different communication operations, scales, and message sizes. We propose MCR-DL: an extensible DL communication framework that supports all point-to-point and collective operations while enabling users to dynamically mix-and-match communication backends for a given operation without deadlocks. MCR-DL also comes packaged with a tuning suite for dynamically selecting the best communication backend for a given input tensor. We select DeepSpeed-MoE and DLRM as candidate DL models and demonstrate a 31% improvement in DS-MoE throughput on 256 V100 GPUs on the Lassen HPC system. Further, we achieve a 20% throughput improvement in a dense Megatron-DeepSpeed model and a 25% throughput improvement in DLRM on 32 A100 GPUs with the Theta-GPU HPC system.
DCAug 19, 2024
Demystifying the Communication Characteristics for Distributed Transformer ModelsQuentin Anthony, Benjamin Michalowicz, Jacob Hatef et al.
Deep learning (DL) models based on the transformer architecture have revolutionized many DL applications such as large language models (LLMs), vision transformers, audio generation, and time series prediction. Much of this progress has been fueled by distributed training, yet distributed communication remains a substantial bottleneck to training progress. This paper examines the communication behavior of transformer models - that is, how different parallelism schemes used in multi-node/multi-GPU DL Training communicate data in the context of transformers. We use GPT-based language models as a case study of the transformer architecture due to their ubiquity. We validate the empirical results obtained from our communication logs using analytical models. At a high level, our analysis reveals a need to optimize small message point-to-point communication further, correlations between sequence length, per-GPU throughput, model size, and optimizations used, and where to potentially guide further optimizations in framework and HPC middleware design and optimization.
32.9DCMar 24
Scaled Block Vecchia Approximation for High-Dimensional Gaussian Process Emulation on GPUsQilong Pan, Sameh Abdulah, Mustafa Abduljabbar et al.
Emulating computationally intensive scientific simulations is crucial for enabling uncertainty quantification, optimization, and informed decision-making at scale. Gaussian Processes (GPs) offer a flexible and data-efficient foundation for statistical emulation, but their poor scalability limits applicability to large datasets. We introduce the Scaled Block Vecchia (SBV) algorithm for distributed GPU-based systems. SBV integrates the Scaled Vecchia approach for anisotropic input scaling with the Block Vecchia (BV) method to reduce computational and memory complexity while leveraging GPU acceleration techniques for efficient linear algebra operations. To the best of our knowledge, this is the first distributed implementation of any Vecchia-based GP variant. Our implementation employs MPI for inter-node parallelism and the MAGMA library for GPU-accelerated batched matrix computations. We demonstrate the scalability and efficiency of the proposed algorithm through experiments on synthetic and real-world workloads, including a 50M point simulation from a respiratory disease model. SBV achieves near-linear scalability on up to 512 A100 and GH200 GPUs, handles 2.56B points, and reduces energy use relative to exact GP solvers, establishing SBV as a scalable and energy-efficient framework for emulating large-scale scientific models on GPU-based distributed systems.
PFFeb 23, 2022
Shisha: Online scheduling of CNN pipelines on heterogeneous architecturesPirah Noor Soomro, Mustafa Abduljabbar, Jeronimo Castrillon et al.
Chiplets have become a common methodology in modern chip design. Chiplets improve yield and enable heterogeneity at the level of cores, memory subsystem and the interconnect. Convolutional Neural Networks (CNNs) have high computational, bandwidth and memory capacity requirements owing to the increasingly large amount of weights. Thus to exploit chiplet-based architectures, CNNs must be optimized in terms of scheduling and workload distribution among computing resources. We propose Shisha, an online approach to generate and schedule parallel CNN pipelines on chiplet architectures. Shisha targets heterogeneity in compute performance and memory bandwidth and tunes the pipeline schedule through a fast online exploration technique. We compare Shisha with Simulated Annealing, Hill Climbing and Pipe-Search. On average, the convergence time is improved by ~35x in Shisha compared to other exploration algorithms. Despite the quick exploration, Shisha's solution is often better than that of other heuristic exploration algorithms.
LGAug 19, 2012
Adaptive Graph via Multiple Kernel Learning for Nonnegative Matrix FactorizationJing-Yan Wang, Mustafa AbdulJabbar
Nonnegative Matrix Factorization (NMF) has been continuously evolving in several areas like pattern recognition and information retrieval methods. It factorizes a matrix into a product of 2 low-rank non-negative matrices that will define parts-based, and linear representation of nonnegative data. Recently, Graph regularized NMF (GrNMF) is proposed to find a compact representation,which uncovers the hidden semantics and simultaneously respects the intrinsic geometric structure. In GNMF, an affinity graph is constructed from the original data space to encode the geometrical information. In this paper, we propose a novel idea which engages a Multiple Kernel Learning approach into refining the graph structure that reflects the factorization of the matrix and the new data space. The GrNMF is improved by utilizing the graph refined by the kernel learning, and then a novel kernel learning method is introduced under the GrNMF framework. Our approach shows encouraging results of the proposed algorithm in comparison to the state-of-the-art clustering algorithms like NMF, GrNMF, SVD etc.