DCDec 22, 2025
UCCL-EP: Portable Expert-Parallel CommunicationZiming Mao, Yihan Zhang, Chihan Cui et al.
Mixture-of-Experts (MoE) workloads rely on expert parallelism (EP) to achieve high GPU efficiency. State-of-the-art EP communication systems such as DeepEP demonstrate strong performance but exhibit poor portability across heterogeneous GPU and NIC platforms. The poor portability is rooted in architecture: GPU-initiated token-level RDMA communication requires tight vertical integration between GPUs and NICs, e.g., GPU writes to NIC driver/MMIO interfaces. We present UCCL-EP, a portable EP communication system that delivers DeepEP-level performance across heterogeneous GPU and NIC hardware. UCCL-EP replaces GPU-initiated RDMA with a high-throughput GPU-CPU control channel: compact token-routing commands are transferred to multithreaded CPU proxies, which then issue GPUDirect RDMA operations on behalf of GPUs. UCCL-EP further emulates various ordering semantics required by specialized EP communication modes using RDMA immediate data, enabling correctness on NICs that lack such ordering, e.g., AWS EFA. We implement UCCL-EP on NVIDIA and AMD GPUs with EFA and Broadcom NICs. On EFA, it outperforms the best existing EP solution by up to $2.1\times$ for dispatch and combine throughput. On NVIDIA-only platform, UCCL-EP achieves comparable performance to the original DeepEP. UCCL-EP also improves token throughput on SGLang by up to 40% on the NVIDIA+EFA platform, and improves DeepSeek-V3 training throughput over the AMD Primus/Megatron-LM framework by up to 45% on a 16-node AMD+Broadcom platform.
NIMay 21
EnCoR: An end-to-end architecture for simplifying cellular networksWesley Woo, Zhuowei Wen, Monniiesh Velmurugan et al.
Since their creation, cellular networks have made in-network mobility support a key feature of their service model. While this approach provides seamless connectivity for legacy traffic, it has the side effects of inflating end-user latency and increasing complexity and operational overhead for operators. Yet modern applications and transport protocols are increasingly mobility tolerant, prompting us to revisit the assumption that mobility must be provided as an in-network service. In this paper, we propose EnCoR (End-to-End Core and RAN), a deployable cellular network architecture that removes mobility from the core entirely. Leveraging end-to-end mobility, EnCoR eliminates tunnel-based IP anchoring while preserving compatibility with existing authentication, charging, and QoS techniques. We demonstrate that EnCoR works with unmodified phones while providing equivalent performance as traditional LTE networks for real applications including video and voice calling and video streaming. We show that EnCoR not only allows network operators to reduce end to end latency, but can also reduce the capital cost of providing low latency service to users by more than 90% compared to 3GPP networks, based on cost estimates for cellular network core and border router infrastructure provided by the FCC. Finally, we demonstrate that these gains are achieved while reducing the amount of overall handover control messaging, allowing the EnCoR core network to handle a greater number of mobility handover events than an LTE core under identical hardware constraints, achieving a 2.6x lower handover latency under load.
OSMay 19
Clove: Object-Level CXL Memory Management in Managed RuntimesSam Son, Zhihong Luo, Wen Zhang et al.
Object-level management of tiered memory has been studied to address the inefficiencies in page-based systems. However, object-level management for CXL-tiered memory remains underexplored due to CXL's tight performance budget and load/store interface. As a result, existing approaches remain limited in scope, primarily targeting unmanaged-language applications with bespoke runtimes or compiler support. This paper identifies and explores a new design point for object-level CXL management: managed languages and their runtimes. The key observation is that existing managed runtimes already provide highly optimized mechanisms for problems closely related to object-level management, including object relocation and dynamic code generation. However, they still lack the features needed for tiered memory management, such as hotness tracking and relocation policies, and thus must be carefully extended to fully realize this direction. We present Clove, a system that extends existing managed runtimes to support object-level CXL management for managed-language applications. Clove combines profile-guided object hotness tracking with object relocation techniques and policies. Our JVM prototype demonstrates that this extension enables high utilization of fast-tier memory while bounding runtime overhead, reducing application slowdown by 22-84% compared to page-based systems.
NIApr 30
Rethinking Network Topologies for Cost-Effective Mixture-of-Experts LLM ServingJunsun Choi, Sam Son, Sunjin Choi et al.
Mixture-of-experts (MoE) architectures have turned LLM serving into a cluster-scale workload in which communication consumes a considerable portion of LLM serving runtime. This has prompted industry to invest heavily in expensive high-bandwidth scale-up networks. We question whether such costly infrastructure is strictly necessary. We present the first systematic cross-layer analysis of network cost-effectiveness for MoE LLM serving, comparing four representative XPU (e.g., GPU/TPU) topologies (scale-up, scale-out, 3D torus, and 3D full-mesh). We find that lower-cost switchless topologies are more cost-effective than the scale-up topology across all serving scenarios explored, improving cost-effectiveness by 20.6-56.2%. In particular, the 3D full-mesh topology is Pareto-optimal in terms of the performance-cost tradeoff. We also find that current scale-up link bandwidths are over-provisioned: reducing the link bandwidth improves throughput per cost by up to 27%. A forward-looking analysis of upcoming GPU generations indicates that the cost-performance advantage of switchless networks will likely persist.
DCNov 3, 2024
SkyServe: Serving AI Models across Regions and Clouds with Spot InstancesZiming Mao, Tian Xia, Zhanghao Wu et al.
Recent years have witnessed an explosive growth of AI models. The high cost of hosting AI services on GPUs and their demanding service requirements, make it timely and challenging to lower service costs and guarantee service quality. While spot instances have long been offered with a large discount, spot preemptions have discouraged users from using them to host model replicas when serving AI models. To address this, we propose a simple yet efficient policy, SpotHedge, that leverages spot replicas across different failure domains (e.g., regions and clouds) to ensure availability, lower costs, and high service quality. SpotHedge intelligently spreads spot replicas across different regions and clouds to improve availability and reduce correlated preemptions, overprovisions cheap spot replicas than required as a safeguard against possible preemptions, and dynamically falls back to on-demand replicas when spot replicas become unavailable. We built SkyServe, a system leveraging SpotHedge to efficiently serve AI models over a mixture of spot and on-demand replicas across regions and clouds. We compared SkyServe with both research and production systems on real AI workloads: SkyServe reduces cost by 43% on average while achieving high resource availability compared to using on-demand replicas. Additionally, SkyServe improves P50, P90, and P99 latency by 2.3$\times$, 2.1$\times$, 2.1$\times$ on average compared to other research and production systems.
NIJul 28, 2025
Load Balancing for AI Training WorkloadsSarah McClure, Sylvia Ratnasamy, Scott Shenker
We investigate the performance of various load balancing algorithms for large-scale AI training workloads that are running on dedicated infrastructure. The performance of load balancing depends on both the congestion control and loss recovery algorithms, so our evaluation also sheds light on the appropriate choices for those designs as well.
NIOct 21, 2024
Managing Bandwidth: The Key to Cloud-Assisted Autonomous DrivingAlexander Krentsel, Peter Schafhalter, Joseph E. Gonzalez et al.
Prevailing wisdom asserts that one cannot rely on the cloud for critical real-time control systems like self-driving cars. We argue that we can, and must. Following the trends of increasing model sizes, improvements in hardware, and evolving mobile networks, we identify an opportunity to offload parts of time-sensitive and latency-critical compute to the cloud. Doing so requires carefully allocating bandwidth to meet strict latency SLOs, while maximizing benefit to the car.
NIApr 21, 2020
How to Train your DNN: The Network Operator EditionMichael Alan Chang, Domenic Bottini, Lisa Jian et al.
Deep Neural Nets have hit quite a crest, But physical networks are where they must rest, And here we put them all to the test, To see which network optimization is best.