DCMar 27

Syncopate: Efficient Multi-GPU AI Kernels via Automatic Chunk-Centric Compute-Communication Overlap

Xinwei Qiang, Yue Guan, Zhengding Hu, Yufei Ding, Adnan Aziz

arXiv:2601.2059575.9h-index: 3

AI Analysis

This addresses communication inefficiencies in distributed AI systems, offering a novel compiler-based solution for performance improvement.

The paper tackles the bottleneck of communication in large-scale GPU workloads by introducing Syncopate, a compiler and runtime that enables automatic fine-grained compute-communication overlap inside a single fused kernel, achieving an average end-to-end speedup of 1.3× and up to 4.7× on multi-GPU workloads.

Communication has become a first-order bottleneck in large-cale GPU workloads, and existing distributed compilers address it mainly by overlapping whole compute and communication kernels at the stream level. This coarse granularity incurs extra kernel launches, forces device-wide synchronizations at kernel boundaries, and leaves substantial slack when the slowest tile or kernel stretches the communication tail. We present Syncopate, a compiler and runtime that enables automatic fine-grained overlap inside a single fused kernel. Syncopate introduces a communication chunk abstraction that decouples communication granularity from kernel structure and backend mechanisms, allowing chunk-level plans to be ported from existing distributed compilers, written directly by users, or instantiated from reusable templates. Given a local Triton kernel and a chunk schedule, Syncopate performs transformations to align computation with chunk availability. Implemented as a source-to-source compiler on Triton, Syncopate delivers an average end-to-end speedup of 1.3$\times$ and up to 4.7$\times$ on multi-GPU workloads.

View on arXiv PDF

Similar