DCAILGFeb 28, 2025

SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models

arXiv:2502.20727v47 citationsh-index: 5ICML
Originality Incremental advance
AI Analysis

This addresses scalability and low latency issues in distributed inference for large language models, representing an incremental optimization.

The paper tackles communication overhead in tensor parallelism for large language model inference by introducing Sync-Point Drop (SPD), which selectively drops synchronization on attention outputs, resulting in about 20% overall inference latency reduction with < 1% accuracy regression for LLaMA2-70B over 8 GPUs.

With the rapid expansion in the scale of large language models (LLMs), enabling efficient distributed inference across multiple computing units has become increasingly critical. However, communication overheads from popular distributed inference techniques such as Tensor Parallelism pose a significant challenge to achieve scalability and low latency. Therefore, we introduce a novel optimization technique, Sync-Point Drop (SPD), to reduce communication overheads in tensor parallelism by selectively dropping synchronization on attention outputs. In detail, we first propose a block design that allows execution to proceed without communication through SPD. Second, we apply different SPD strategies to attention blocks based on their sensitivity to the model accuracy. The proposed methods effectively alleviate communication bottlenecks while minimizing accuracy degradation during LLM inference, offering a scalable solution for diverse distributed environments: SPD offered about 20% overall inference latency reduction with < 1% accuracy regression for LLaMA2-70B inference over 8 GPUs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes