Jack Kosaian

DC
6papers
117citations
Novelty62%
AI Score48

6 Papers

LGDec 15, 2022
A Study on the Intersection of GPU Utilization and CNN Inference

Jack Kosaian, Amar Phanishayee

There has been significant progress in developing neural network architectures that both achieve high predictive performance and that also achieve high application-level inference throughput (e.g., frames per second). Another metric of increasing importance is GPU utilization during inference: the measurement of how well a deployed neural network uses the computational capabilities of the GPU on which it runs. Achieving high GPU utilization is critical to increasing application-level throughput and ensuring a good return on investment for deploying GPUs. This paper analyzes the GPU utilization of convolutional neural network (CNN) inference. We first survey the GPU utilization of CNNs to show that there is room to improve the GPU utilization of many of these CNNs. We then investigate the GPU utilization of networks within a neural architecture search (NAS) search space, and explore how using GPU utilization as a metric could potentially be used to accelerate NAS itself. Our study makes the case that there is room to improve the inference-time GPU utilization of CNNs and that knowledge of GPU utilization has the potential to benefit even applications that do not target utilization itself. We hope that the results of this study will spur future innovation in designing GPU-efficient neural networks.

59.3DCMay 15
Exceeding the Numerical and Performance Characteristics of IEEE-754 SGEMM with BFloat16 Tensor Cores on GPUs for Scientific Computing

Harun Bayraktar, Cole Brower, John Gunnels et al.

Largely due to their increased native capacity for numerical intensity and power efficiency, reduced-precision floating-point computing resources, primarily used in artificial intelligence (AI) applications, have expanded at a greater rate than their higher-precision relatives. This has led to various efforts focused upon leveraging plentiful reduced-precision hardware to mimic higher-precision mathematical calculations. This paper studies a specific use case, namely the use of bfloat16 (BF16) Tensor Cores found on modern GPUs in service of single precision (FP32) matrix multiply operations. Given that BF16 and FP32 share the same dynamic range, the option to accumulate BF16 operations into FP32 accumulators (at full-speed), and additional BF16 arithmetic characteristics specific to the Blackwell GPU architecture, such as integrated scaling hardware, such emulation is highly motivated. This paper examines the performance, efficiency, power, and numerical characteristics of FP32 matrix multiplication via BF16-based emulation and demonstrates how it exceeds numerical and performance characteristics of native FP32 for scientific applications. We also discuss a full library-ready implementation that correctly deals with denormals.

LGApr 5, 2021Code
ECRM: Efficient Fault Tolerance for Recommendation Model Training via Erasure Coding

Kaige Liu, Jack Kosaian, K. V. Rashmi

Deep-learning-based recommendation models (DLRMs) are widely deployed to serve personalized content to users. DLRMs are large in size due to their use of large embedding tables, and are trained by distributing the model across the memory of tens or hundreds of servers. Server failures are common in such large distributed systems and must be mitigated to enable training to progress. Checkpointing is the primary approach used for fault tolerance in these systems, but incurs significant training-time overhead both during normal operation and when recovering from failures. As these overheads increase with DLRM size, checkpointing is slated to become an even larger overhead for future DLRMs, which are expected to grow in size. This calls for rethinking fault tolerance in DLRM training. We present ECRM, a DLRM training system that achieves efficient fault tolerance using erasure coding. ECRM chooses which DLRM parameters to encode, correctly and efficiently updates parities, and enables training to proceed without any pauses, while maintaining consistency of the recovered parameters. We implement ECRM atop XDL, an open-source, industrial-scale DLRM training system. Compared to checkpointing, ECRM reduces training-time overhead for large DLRMs by up to 88%, recovers from failures up to 10.3$\times$ faster, and allows training to proceed during recovery. These results show the promise of erasure coding in imparting efficient fault tolerance to training current and future DLRMs.

DCMay 2, 2019Code
Parity Models: A General Framework for Coding-Based Resilience in ML Inference

Jack Kosaian, K. V. Rashmi, Shivaram Venkataraman

Machine learning models are becoming the primary workhorses for many applications. Production services deploy models through prediction serving systems that take in queries and return predictions by performing inference on machine learning models. In order to scale to high query rates, prediction serving systems are run on many machines in cluster settings, and thus are prone to slowdowns and failures that inflate tail latency and cause violations of strict latency targets. Current approaches to reducing tail latency are inadequate for the latency targets of prediction serving, incur high resource overhead, or are inapplicable to the computations performed during inference. We present ParM, a novel, general framework for making use of ideas from erasure coding and machine learning to achieve low-latency, resource-efficient resilience to slowdowns and failures in prediction serving systems. ParM encodes multiple queries together into a single parity query and performs inference on the parity query using a parity model. A decoder uses the output of a parity model to reconstruct approximations of unavailable predictions. ParM uses neural networks to learn parity models that enable simple, fast encoders and decoders to reconstruct unavailable predictions for a variety of inference tasks such as image classification, speech recognition, and object localization. We build ParM atop an open-source prediction serving system and through extensive evaluation show that ParM improves overall accuracy in the face of unavailability with low latency while using 2-4$\times$ less additional resources than replication-based approaches. ParM reduces the gap between 99.9th percentile and median latency by up to $3.5\times$ compared to approaches that use an equal amount of resources, while maintaining the same median.

DCApr 19, 2021
Arithmetic-Intensity-Guided Fault Tolerance for Neural Network Inference on GPUs

Jack Kosaian, K. V. Rashmi

Neural networks (NNs) are increasingly employed in safety-critical domains and in environments prone to unreliability (e.g., soft errors), such as on spacecraft. Therefore, it is critical to impart fault tolerance to NN inference. Algorithm-based fault tolerance (ABFT) is emerging as an efficient approach for fault tolerance in NNs. We propose an adaptive approach to ABFT for NN inference that exploits untapped opportunities in emerging deployment scenarios. GPUs have high compute-to-memory-bandwidth ratios, while NN layers have a wide range of arithmetic intensities. This leaves some layers compute bound and others memory-bandwidth bound, but current approaches to ABFT do not consider these differences. We first investigate ABFT schemes best suited for each of these scenarios. We then propose intensity-guided ABFT, an adaptive, arithmetic-intensity-guided approach that selects the most efficient ABFT scheme for each NN layer. Intensity-guided ABFT reduces execution-time overhead by 1.09--5.3$\times$ across many NNs compared to traditional approaches to ABFT.

LGJun 4, 2018
Learning a Code: Machine Learning for Approximate Non-Linear Coded Computation

Jack Kosaian, K. V. Rashmi, Shivaram Venkataraman

Machine learning algorithms are typically run on large scale, distributed compute infrastructure that routinely face a number of unavailabilities such as failures and temporary slowdowns. Adding redundant computations using coding-theoretic tools called "codes" is an emerging technique to alleviate the adverse effects of such unavailabilities. A code consists of an encoding function that proactively introduces redundant computation and a decoding function that reconstructs unavailable outputs using the available ones. Past work focuses on using codes to provide resilience for linear computations and specific iterative optimization algorithms. However, computations performed for a variety of applications including inference on state-of-the-art machine learning algorithms, such as neural networks, typically fall outside this realm. In this paper, we propose taking a learning-based approach to designing codes that can handle non-linear computations. We present carefully designed neural network architectures and a training methodology for learning encoding and decoding functions that produce approximate reconstructions of unavailable computation results. We present extensive experimental results demonstrating the effectiveness of the proposed approach: we show that the our learned codes can accurately reconstruct $64 - 98\%$ of the unavailable predictions from neural-network based image classifiers on the MNIST, Fashion-MNIST, and CIFAR-10 datasets. To the best of our knowledge, this work proposes the first learning-based approach for designing codes, and also presents the first coding-theoretic solution that can provide resilience for any non-linear (differentiable) computation. Our results show that learning can be an effective technique for designing codes, and that learned codes are a highly promising approach for bringing the benefits of coding to non-linear computations.