39.3DCApr 23
Leveraging SIMD for Accelerating Large-number ArithmeticSubhrajit Das, Abhishek Bichhawat, Yuvraj Patel
Large-number arithmetic, widely used in scientific computing and cryptography, has seen limited adoption of single instruction, multiple data (SIMD) parallelism on modern CPUs due to the inherent dependencies in traditional algorithms. We present DigitsOnTurbo (DoT), which restructures the computation around independent, data-parallel operations, rather than vectorizing the standard algorithms, thereby leveraging the benefits provided by SIMD. Over prior SIMD implementations, DoT achieves up to 1.85x speedups for addition and subtraction, and 2.3x for multiplication. When integrated into state-of-the-art libraries, DoT yields up to 4x speedup for addition and subtraction, and up to 2x speedup for multiplication, cascading into end-to-end throughput gains of up to 19.3% for scientific computations, and up to 7.9% latency and 5.9% throughput improvements on cryptographic implementations.
LGJan 25, 2024
ServerlessLLM: Low-Latency Serverless Inference for Large Language ModelsYao Fu, Leyang Xue, Yeqi Huang et al.
This paper presents ServerlessLLM, a distributed system designed to support low-latency serverless inference for Large Language Models (LLMs). By harnessing the substantial near-GPU storage and memory capacities of inference servers, ServerlessLLM achieves effective local checkpoint storage, minimizing the need for remote checkpoint downloads and ensuring efficient checkpoint loading. The design of ServerlessLLM features three core contributions: (i) \emph{fast multi-tier checkpoint loading}, featuring a new loading-optimized checkpoint format and a multi-tier loading system, fully utilizing the bandwidth of complex storage hierarchies on GPU servers; (ii) \emph{efficient live migration of LLM inference}, which enables newly initiated inferences to capitalize on local checkpoint storage while ensuring minimal user interruption; and (iii) \emph{startup-time-optimized model scheduling}, which assesses the locality statuses of checkpoints on each server and schedules the model onto servers that minimize the time to start the inference. Comprehensive evaluations, including microbenchmarks and real-world scenarios, demonstrate that ServerlessLLM dramatically outperforms state-of-the-art serverless systems, reducing latency by 10 - 200X across various LLM inference workloads.