DCMay 5
Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUsYixuan Mei, Zikun Li, Zixuan Chen et al.
The usage of large language models (LLMs) has grown increasingly fragmented, with no single model dominating. Meanwhile, cloud providers offer a wide range of mid-tier and older-generation GPUs that enjoy better availability and deliver comparable performance per dollar to top-tier hardware. To efficiently harness these heterogeneous resources for serving multiple LLMs concurrently, we introduce Coral, an adaptive heterogeneity-aware multi-LLM serving system. The key idea behind Coral is to jointly optimize resource allocation and the serving strategy of each model replica across all models. To keep pace with shifting throughput demand and resource availability, Coral applies a lossless two-stage decomposition that preserves joint optimality while cutting online solve time from hours to tens of seconds. Our evaluation across 6 models and 20 GPU configurations shows that Coral reduces serving cost by up to 2.79$\times$ over the best baseline, and delivers up to 2.39$\times$ higher goodput under scarce resource availability.
ITApr 16
Bandwidth Cost of Locally Repairable Convertible Codes in the Global Merge RegimeSaransh Chopra, Shubhransh Singhvi, K. V. Rashmi
Recent studies have shown that distributed storage systems can achieve significant space savings by adapting redundancy levels to varying disk failure rates. This adaptation is performed via code conversion, wherein data encoded under an initial code are transformed to data encoded under a final code. While this process is typically resource-intensive, convertible codes are designed to enable these transformations efficiently while preserving desirable decodability constraints such as repair degree, or the number of nodes accessed during node repair. In this work, we focus on the bandwidth cost of conversion, or the total amount of data transferred during the conversion process. We study fundamental limits on the bandwidth cost of conversion between systematic optimal-distance Locally Repairable Codes (LRCs). We restrict our focus to the global merge regime, in which multiple initial codewords are combined to form a single final codeword while preserving information locality. We focus on stable convertible codes, wherein the number of unchanged nodes is maximized during conversion. We generalize an information-theoretic approach for modeling code conversion to the LRC setting, and derive the first non-trivial lower bounds on the bandwidth cost of conversion in this regime. Notably, our bounds do not rely on any linearity assumptions. Consequently, we show that the constructions of Maturana and Rashmi are bandwidth-optimal across a broad range of parameters in the global merge regime.
LGApr 5, 2021Code
ECRM: Efficient Fault Tolerance for Recommendation Model Training via Erasure CodingKaige Liu, Jack Kosaian, K. V. Rashmi
Deep-learning-based recommendation models (DLRMs) are widely deployed to serve personalized content to users. DLRMs are large in size due to their use of large embedding tables, and are trained by distributing the model across the memory of tens or hundreds of servers. Server failures are common in such large distributed systems and must be mitigated to enable training to progress. Checkpointing is the primary approach used for fault tolerance in these systems, but incurs significant training-time overhead both during normal operation and when recovering from failures. As these overheads increase with DLRM size, checkpointing is slated to become an even larger overhead for future DLRMs, which are expected to grow in size. This calls for rethinking fault tolerance in DLRM training. We present ECRM, a DLRM training system that achieves efficient fault tolerance using erasure coding. ECRM chooses which DLRM parameters to encode, correctly and efficiently updates parities, and enables training to proceed without any pauses, while maintaining consistency of the recovered parameters. We implement ECRM atop XDL, an open-source, industrial-scale DLRM training system. Compared to checkpointing, ECRM reduces training-time overhead for large DLRMs by up to 88%, recovers from failures up to 10.3$\times$ faster, and allows training to proceed during recovery. These results show the promise of erasure coding in imparting efficient fault tolerance to training current and future DLRMs.
DCMay 2, 2019Code
Parity Models: A General Framework for Coding-Based Resilience in ML InferenceJack Kosaian, K. V. Rashmi, Shivaram Venkataraman
Machine learning models are becoming the primary workhorses for many applications. Production services deploy models through prediction serving systems that take in queries and return predictions by performing inference on machine learning models. In order to scale to high query rates, prediction serving systems are run on many machines in cluster settings, and thus are prone to slowdowns and failures that inflate tail latency and cause violations of strict latency targets. Current approaches to reducing tail latency are inadequate for the latency targets of prediction serving, incur high resource overhead, or are inapplicable to the computations performed during inference. We present ParM, a novel, general framework for making use of ideas from erasure coding and machine learning to achieve low-latency, resource-efficient resilience to slowdowns and failures in prediction serving systems. ParM encodes multiple queries together into a single parity query and performs inference on the parity query using a parity model. A decoder uses the output of a parity model to reconstruct approximations of unavailable predictions. ParM uses neural networks to learn parity models that enable simple, fast encoders and decoders to reconstruct unavailable predictions for a variety of inference tasks such as image classification, speech recognition, and object localization. We build ParM atop an open-source prediction serving system and through extensive evaluation show that ParM improves overall accuracy in the face of unavailability with low latency while using 2-4$\times$ less additional resources than replication-based approaches. ParM reduces the gap between 99.9th percentile and median latency by up to $3.5\times$ compared to approaches that use an equal amount of resources, while maintaining the same median.
DCApr 19, 2021
Arithmetic-Intensity-Guided Fault Tolerance for Neural Network Inference on GPUsJack Kosaian, K. V. Rashmi
Neural networks (NNs) are increasingly employed in safety-critical domains and in environments prone to unreliability (e.g., soft errors), such as on spacecraft. Therefore, it is critical to impart fault tolerance to NN inference. Algorithm-based fault tolerance (ABFT) is emerging as an efficient approach for fault tolerance in NNs. We propose an adaptive approach to ABFT for NN inference that exploits untapped opportunities in emerging deployment scenarios. GPUs have high compute-to-memory-bandwidth ratios, while NN layers have a wide range of arithmetic intensities. This leaves some layers compute bound and others memory-bandwidth bound, but current approaches to ABFT do not consider these differences. We first investigate ABFT schemes best suited for each of these scenarios. We then propose intensity-guided ABFT, an adaptive, arithmetic-intensity-guided approach that selects the most efficient ABFT scheme for each NN layer. Intensity-guided ABFT reduces execution-time overhead by 1.09--5.3$\times$ across many NNs compared to traditional approaches to ABFT.
LGJun 4, 2018
Learning a Code: Machine Learning for Approximate Non-Linear Coded ComputationJack Kosaian, K. V. Rashmi, Shivaram Venkataraman
Machine learning algorithms are typically run on large scale, distributed compute infrastructure that routinely face a number of unavailabilities such as failures and temporary slowdowns. Adding redundant computations using coding-theoretic tools called "codes" is an emerging technique to alleviate the adverse effects of such unavailabilities. A code consists of an encoding function that proactively introduces redundant computation and a decoding function that reconstructs unavailable outputs using the available ones. Past work focuses on using codes to provide resilience for linear computations and specific iterative optimization algorithms. However, computations performed for a variety of applications including inference on state-of-the-art machine learning algorithms, such as neural networks, typically fall outside this realm. In this paper, we propose taking a learning-based approach to designing codes that can handle non-linear computations. We present carefully designed neural network architectures and a training methodology for learning encoding and decoding functions that produce approximate reconstructions of unavailable computation results. We present extensive experimental results demonstrating the effectiveness of the proposed approach: we show that the our learned codes can accurately reconstruct $64 - 98\%$ of the unavailable predictions from neural-network based image classifiers on the MNIST, Fashion-MNIST, and CIFAR-10 datasets. To the best of our knowledge, this work proposes the first learning-based approach for designing codes, and also presents the first coding-theoretic solution that can provide resilience for any non-linear (differentiable) computation. Our results show that learning can be an effective technique for designing codes, and that learned codes are a highly promising approach for bringing the benefits of coding to non-linear computations.
ITAug 16, 2015
Information-theoretically Secure Erasure Codes for Distributed StorageNihar B. Shah, K. V. Rashmi, Kannan Ramchandran et al.
Repair operations in distributed storage systems potentially expose the data to malicious acts of passive eavesdroppers or active adversaries, which can be detrimental to the security of the system. This paper presents erasure codes and repair algorithms that ensure security of the data in the presence of passive eavesdroppers and active adversaries, while maintaining high availability, reliability and efficiency in the system. Our codes are optimal in that they meet previously proposed lower bounds on the storage, network-bandwidth, and reliability requirements for a wide range of system parameters. Our results thus establish the capacity of such systems. Our codes for security from active adversaries provide an additional appealing feature of `on-demand security' where the desired level of security can be chosen separately for each instance of repair, and our algorithms remain optimal simultaneously for all possible levels. The paper also provides necessary and sufficient conditions governing the transformation of any (non-secure) code into one providing on-demand security.
LGMay 7, 2015
DART: Dropouts meet Multiple Additive Regression TreesK. V. Rashmi, Ran Gilad-Bachrach
Multiple Additive Regression Trees (MART), an ensemble model of boosted regression trees, is known to deliver high prediction accuracy for diverse tasks, and it is widely used in practice. However, it suffers an issue which we call over-specialization, wherein trees added at later iterations tend to impact the prediction of only a few instances, and make negligible contribution towards the remaining instances. This negatively affects the performance of the model on unseen data, and also makes the model over-sensitive to the contributions of the few, initially added tress. We show that the commonly used tool to address this issue, that of shrinkage, alleviates the problem only to a certain extent and the fundamental issue of over-specialization still remains. In this work, we explore a different approach to address the problem that of employing dropouts, a tool that has been recently proposed in the context of learning deep neural networks. We propose a novel way of employing dropouts in MART, resulting in the DART algorithm. We evaluate DART on ranking, regression and classification tasks, using large scale, publicly available datasets, and show that DART outperforms MART in each of the tasks, with a significant margin. We also show that DART overcomes the issue of over-specialization to a considerable extent.
CRJun 30, 2012
Distributed Secret Dissemination Across a NetworkNihar B. Shah, K. V. Rashmi, Kannan Ramchandran
Shamir's (n, k) threshold secret sharing is an important component of several cryptographic protocols, such as those for secure multiparty-computation and key management. These protocols typically assume the presence of direct communication links from the dealer to all participants, in which case the dealer can directly pass the shares of the secret to each participant. In this paper, we consider the problem of secret sharing when the dealer does not have direct communication links to all the participants, and instead, the dealer and the participants form a general network. Existing methods are based on secure message transmissions from the dealer to each participant requiring considerable coordination in the network. In this paper, we present a distributed algorithm for disseminating shares over a network, which we call the SNEAK algorithm, requiring each node to know only the identities of its one-hop neighbours. While SNEAK imposes a stronger condition on the network by requiring the dealer to be what we call k-propagating rather than k-connected as required by the existing solutions, we show that in addition to being distributed, SNEAK achieves significant reduction in the communication cost and the amount of randomness required.