DC AR LG NIDec 18, 2023

ACCL+: an FPGA-Based Collective Engine for Distributed Applications

Zhenhao He, Dario Korolija, Yu Zhu, Benjamin Ramhorst, Tristan Laan, Lucian Petrica, Michaela Blott, Gustavo Alonso

arXiv:2312.11742v15.912 citationsh-index: 10Has CodeOSDI

Originality Incremental advance

AI Analysis

This addresses the problem of cumbersome infrastructure for developers of distributed FPGA applications, offering a versatile and portable solution that is incremental but impactful for specific domains like deep learning inference.

The paper tackles the challenge of developing distributed FPGA-accelerated applications by proposing ACCL+, an open-source FPGA-based collective communication library that enables direct FPGA-to-FPGA communication and offloads CPU networking tasks, demonstrating significant performance advantages over software MPI over RDMA on a 100 Gb/s network.

FPGAs are increasingly prevalent in cloud deployments, serving as Smart NICs or network-attached accelerators. Despite their potential, developing distributed FPGA-accelerated applications remains cumbersome due to the lack of appropriate infrastructure and communication abstractions. To facilitate the development of distributed applications with FPGAs, in this paper we propose ACCL+, an open-source versatile FPGA-based collective communication library. Portable across different platforms and supporting UDP, TCP, as well as RDMA, ACCL+ empowers FPGA applications to initiate direct FPGA-to-FPGA collective communication. Additionally, it can serve as a collective offload engine for CPU applications, freeing the CPU from networking tasks. It is user-extensible, allowing new collectives to be implemented and deployed without having to re-synthesize the FPGA circuit. We evaluated ACCL+ on an FPGA cluster with 100 Gb/s networking, comparing its performance against software MPI over RDMA. The results demonstrate ACCL+'s significant advantages for FPGA-based distributed applications and highly competitive performance for CPU applications. We showcase ACCL+'s dual role with two use cases: seamlessly integrating as a collective offload engine to distribute CPU-based vector-matrix multiplication, and serving as a crucial and efficient component in designing fully FPGA-based distributed deep-learning recommendation inference.

View on arXiv PDF

Similar