Tensor-Parallelism with Partially Synchronized Activations
This addresses communication bottlenecks in distributed LLM training and inference, offering a practical improvement for scaling models.
The paper tackles the problem of high communication overhead in tensor-parallelism for Large Language Models (LLMs) by proposing a method that reduces activation synchronization, achieving a 50% reduction in tensor-parallel communication with no significant drop in pretraining accuracy for 1B and 7B parameter models.
Training and inference of Large Language Models (LLMs) with tensor-parallelism requires substantial communication to synchronize activations. Our findings suggest that with a few minor adjustments to current practices, LLMs can be trained without fully synchronizing activations, reducing bandwidth demands. We name this "Communication-Aware Architecture for Tensor-parallelism" (CAAT-Net). We train 1B and 7B parameter CAAT-Net models, with a 50% reduction in tensor-parallel communication and no significant drop in pretraining accuracy. Furthermore, we demonstrate how CAAT-Net accelerates both training and inference workloads.