DCLGJan 15, 2024

TP-Aware Dequantization

arXiv:2402.04925v1h-index: 36
Originality Incremental advance
AI Analysis

This addresses a specific bottleneck in deploying large language models efficiently for real-world applications, though it appears incremental as it builds on existing quantization and Tensor Parallel methods.

The paper tackles the problem of high inference latency in distributed deployment of Large Language Models by optimizing quantization kernels for Tensor Parallel settings, achieving speedups of up to 1.81x for Llama-70B and 1.78x for Granite-20B on NVIDIA systems.

In this paper, we present a novel method that reduces model inference latency during distributed deployment of Large Language Models (LLMs). Our contribution is an optimized inference deployment scheme that address the current limitations of state-of-the-art quantization kernels when used in conjunction with Tensor Parallel (TP). Our method preserves data locality in GPU memory access patterns and exploits a priori knowledge of TP to reduce global communication. We demonstrate an up to 1.81x speedup over existing methods for Llama-70B and up to 1.78x speedup for IBM WatsonX's Granite-20B MLP layer problem sizes on A100 and H100 NVIDIA DGX Systems for a variety of TP settings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes