DCApr 13

fabric-lib: RDMA Point-to-Point Communication for LLM Systems

arXiv:2510.2765655.38 citationsh-index: 6Has Code
Predicted impact top 25% in DC · last 90 daysOriginality Incremental advance
AI Analysis

It solves the NIC lock-in problem for LLM systems requiring flexible point-to-point communication, enabling portability across hardware providers.

fabric-lib provides a uniform RDMA point-to-point communication interface across different NICs, achieving 400 Gbps peak throughput on both NVIDIA ConnectX-7 and AWS EFA, and enabling production use cases like disaggregated inference, RL weight updates (1.3s for trillion-parameter models), and MoE dispatch/combine with improved latency.

Emerging Large Language Model (LLM) system patterns, such as disaggregated inference, Mixture-of-Experts (MoE) routing, and asynchronous reinforcement fine-tuning, require flexible point-to-point communication beyond simple collectives. Existing implementations are locked to specific Network Interface Controllers (NICs), hindering integration into inference engines and portability across hardware providers. We present fabric-lib, which bridges the functionality of common NICs to expose a uniform interface. fabric-lib exposes one-sided WriteImm operations with a ImmCounter primitive for completion notification, without ordering assumptions of network transport, transparently managing multiple NICs per GPU. We demonstrate peak throughput of 400 Gbps on both NVIDIA ConnectX-7 and AWS Elastic Fabric Adapter (EFA). We showcase fabric-lib through three production systems: (1) KvCache transfer for disaggregated inference with dynamic scaling, (2) RL weight updates achieving 1.3 seconds for trillion-parameter models, and (3) MoE dispatch/combine implementation exceeding DeepEP decode latency on ConnectX-7, with the first viable latencies on EFA. We demonstrate that our portable point-to-point communication complements collectives while avoiding lock-in. fabric-lib is open-sourced at https://github.com/perplexityai/pplx-garden/

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes