Flash Communication: Reducing Tensor Parallelization Bottleneck for Fast Large Language Model Inference
This addresses the problem of slow inference due to communication overhead in distributed LLM setups, offering a domain-specific improvement for faster deployment.
The paper tackles the communication bottleneck in tensor-parallel inference for large language models by introducing Flash Communication, a low-bit compression technique that boosts intra-node communication speed by over 3x and reduces time-to-first-token by 2x with minimal accuracy loss.
The ever-increasing sizes of large language models necessitate distributed solutions for fast inference that exploit multi-dimensional parallelism, where computational loads are split across various accelerators such as GPU clusters. However, this approach often introduces significant communication overhead, especially on devices with limited bandwidth. In this paper, we introduce Flash Communication, a novel low-bit compression technique designed to alleviate the tensor-parallelism communication bottleneck during inference. Our method substantially boosts intra-node communication speed by more than 3x and reduces the time-to-first-token by 2x, with nearly no sacrifice in model accuracy. Extensive experiments on various up-to-date LLMs demonstrate the effectiveness of our approach.