LGJun 24, 2025

Tensor-Parallelism with Partially Synchronized Activations

arXiv:2506.19645v14 citationsh-index: 8
Originality Incremental advance
AI Analysis

This addresses communication bottlenecks in distributed LLM training and inference, offering a practical improvement for scaling models.

The paper tackles the problem of high communication overhead in tensor-parallelism for Large Language Models (LLMs) by proposing a method that reduces activation synchronization, achieving a 50% reduction in tensor-parallel communication with no significant drop in pretraining accuracy for 1B and 7B parameter models.

Training and inference of Large Language Models (LLMs) with tensor-parallelism requires substantial communication to synchronize activations. Our findings suggest that with a few minor adjustments to current practices, LLMs can be trained without fully synchronizing activations, reducing bandwidth demands. We name this "Communication-Aware Architecture for Tensor-parallelism" (CAAT-Net). We train 1B and 7B parameter CAAT-Net models, with a 50% reduction in tensor-parallel communication and no significant drop in pretraining accuracy. Furthermore, we demonstrate how CAAT-Net accelerates both training and inference workloads.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes