NILGMay 29

HetCCL: Enabling Collective Communication For Mixed-Vendor Heterogeneous Clusters

arXiv:2605.3100091.81 citationsh-index: 16
Predicted impact top 1% in NI · last 90 daysOriginality Highly original
AI Analysis

This work addresses the critical problem of efficient collective communication for LLM training on heterogeneous hardware, which is a growing concern for organizations with diverse computing resources.

This paper introduces HetCCL, a framework designed to overcome challenges in collective communication for training Large Language Models (LLMs) on mixed-vendor heterogeneous clusters. HetCCL achieves 17-19x higher bandwidth compared to Gloo in heterogeneous communications and accelerates end-to-end LLM training by up to 16.9% in per-step time.

Training Large Language Models (LLMs) on heterogeneous clusters presents significant challenges for collective communication, as hardware from multiple vendors introduces diverse network and computational characteristics. Existing collective communication frameworks (e.g., NCCL, RCCL) designed for homogeneous environments fail to address mixed-hardware setups, while communication libraries with heterogeneous support (e.g., Gloo, OpenMPI) incur heavy overhead in the data path. This paper presents HetCCL, a framework that enables heterogeneous collective communication by efficient P2P transport across heterogeneous devices (e.g., GPUs), eliminating the host-device memory copy overhead while offloading the control to the CPUs. For combining collectives (e.g., AllReduce, ReduceScatter), HetCCL introduces a border-communicator mechanism that achieves vendor independence by using the intrinsic reduction in the combining collectives in vendor collective communication libraries. With efficient heterogeneous P2P transport and portable reduction mechanism, HetCCL proposes a hierarchical topology abstraction for heterogeneous clusters, dissecting collective communication into cluster-level primitives that guarantee optimal cross-cluster data transfer volume and optimal bandwidth utilization. We implement HetCCL with 4 different vendor support and evaluate it in 4 heterogeneous settings with benchmarks and end-to-end LLM tasks. Our evaluation shows that HetCCL achieves 17-19x higher bandwidth than Gloo in heterogeneous communications, and speeds up end-to-end training by up to 16.9% in the per-step-time.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes