Benjamin Brock

DC
h-index5
3papers
12citations
Novelty47%
AI Score42

3 Papers

DCMay 13
SHIRO: Near-Optimal Communication Strategies for Distributed Sparse Matrix Multiplication

Chen Zhuang, Lingqi Zhang, Benjamin Brock et al.

Distributed Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental operation in high-performance computing and deep learning applications. The major performance bottleneck in distributed SpMM lies in substantial communication overhead, which limits both performance and scalability. In this paper, we identify two key sources of communication inefficiency in distributed SpMM: redundant data transfer due to sparsity unawareness, and suboptimal utilization of hierarchical network topology. To address these, we propose (1) a fine-grained, sparsity-aware communication strategy that reduces communication overhead by exploiting the sparsity pattern of the sparse matrix, and (2) a hierarchical communication strategy that maps the sparsity-aware strategy onto two-tier GPU network architectures, minimizing redundant data movement across slower inter-node links. We implement these optimizations in a comprehensive distributed SpMM framework, \method{}. Extensive evaluations on real-world datasets show that \method{} demonstrates strong scalability up to 128 GPUs, achieving geometric mean speedups of 221.5$\times$, 56.0$\times$, 23.4$\times$, and 8.8$\times$ in SpMM over four state-of-the-art baselines (CAGNET, SPA, BCL, and CoLa, respectively) at this scale.

DCOct 10, 2025Code
Slicing Is All You Need: Towards A Universal One-Sided Algorithm for Distributed Matrix Multiplication

Benjamin Brock, Renato Golin

Many important applications across science, data analytics, and AI workloads depend on distributed matrix multiplication. Prior work has developed a large array of algorithms suitable for different problem sizes and partitionings including 1D, 2D, 1.5D, and 2.5D algorithms. A limitation of current work is that existing algorithms are limited to a subset of partitionings. Multiple algorithm implementations are required to support the full space of possible partitionings. If no algorithm implementation is available for a particular set of partitionings, one or more operands must be redistributed, increasing communication costs. This paper presents a universal one-sided algorithm for distributed matrix multiplication that supports all combinations of partitionings and replication factors. Our algorithm uses slicing (index arithmetic) to compute the sets of overlapping tiles that must be multiplied together. This list of local matrix multiplies can then either be executed directly, or reordered and lowered to an optimized IR to maximize overlap. We implement our algorithm using a high-level C++-based PGAS programming framework that performs direct GPU-to-GPU communication using intra-node interconnects. We evaluate performance for a wide variety of partitionings and replication factors, finding that our work is competitive with PyTorch DTensor, a highly optimized distributed tensor library targeting AI models.

CLMar 9, 2024
UniSparse: An Intermediate Language for General Sparse Format Customization

Jie Liu, Zhongyuan Zhao, Zijian Ding et al.

The ongoing trend of hardware specialization has led to a growing use of custom data formats when processing sparse workloads, which are typically memory-bound. These formats facilitate optimized software/hardware implementations by utilizing sparsity pattern- or target-aware data structures and layouts to enhance memory access latency and bandwidth utilization. However, existing sparse tensor programming models and compilers offer little or no support for productively customizing the sparse formats. Additionally, because these frameworks represent formats using a limited set of per-dimension attributes, they lack the flexibility to accommodate numerous new variations of custom sparse data structures and layouts. To overcome this deficiency, we propose UniSparse, an intermediate language that provides a unified abstraction for representing and customizing sparse formats. Unlike the existing attribute-based frameworks, UniSparse decouples the logical representation of the sparse tensor (i.e., the data structure) from its low-level memory layout, enabling the customization of both. As a result, a rich set of format customizations can be succinctly expressed in a small set of well-defined query, mutation, and layout primitives. We also develop a compiler leveraging the MLIR infrastructure, which supports adaptive customization of formats, and automatic code generation of format conversion and compute operations for heterogeneous architectures. We demonstrate the efficacy of our approach through experiments running commonly-used sparse linear algebra operations with specialized formats on multiple different hardware targets, including an Intel CPU, an NVIDIA GPU, an AMD Xilinx FPGA, and a simulated processing-in-memory (PIM) device.