DCDec 10, 2025
WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM ServingChiheng Lou, Sheng Qi, Rui Kang et al.
Deploying multiple models within shared GPU clusters is promising for improving resource efficiency in large language model (LLM) serving. Existing multi-LLM serving systems optimize GPU utilization at the cost of worse inference performance, especially time-to-first-token (TTFT). We identify the root cause of such compromise as their unawareness of future workload characteristics. In contrast, recent analysis on real-world traces has shown the high periodicity and long-term predictability of LLM serving workloads. We propose universal GPU workers to enable one-for-many GPU prewarming that loads models with knowledge of future workloads. Based on universal GPU workers, we design and build WarmServe, a multi-LLM serving system that (1) mitigates cluster-wide prewarming interference by adopting an evict-aware model placement strategy, (2) prepares universal GPU workers in advance by proactive prewarming, and (3) manages GPU memory with a zero-overhead memory switching mechanism. Evaluation under real-world datasets shows that WarmServe improves TTFT by up to 50.8$\times$ compared to the state-of-the-art autoscaling-based system, while being capable of serving up to 2.5$\times$ more requests compared to the GPU-sharing system.
DCMar 30
Varuna: Enabling Failure-Type Aware RDMA FailoverXiaoyang Wang, Yongkun Li, Lulu Yao et al.
RDMA link failures can render connections temporarily unavailable, causing both performance degradation and significant recovery overhead. To tolerate such failures, production datacenters assign each primary link with a standby link and, upon failure, uniformly retransmit all in-flight RDMA request over the backup path. However, we observe that such blanket retransmission is unnecessary. In-flight requests can be split into pre-failure and post-failure categories depending on whether the responder has already executed. Retransmitting post-failure requests is not only redundant (consuming bandwidth), but also incorrect for non-idempotent operations, where duplicate execution can violate application semantics. We present Varuna, a failure-type-aware RDMA recovery mechanism that enables correct retransmission and us-level failover. Varuna piggybacks a lightweight completion log on every RDMA operation; after a link failure, this log deterministically reveals which in-flight requests were executed (post-failure) and which were lost (pre-failure). Varuna then retransmits only the pre-failure subset and fetches/recovers the return values for post-failure requests. Evaluated using synthetic microbenchmarks and end-to-end RDMA TPC-C transactions, Varuna incurs only 0.6-10% steady-state latency overhead in realistic applications, eliminates 65% of recovery retransmission time, preserves transactional consistency, and introduces zero connectivity rebuild overhead and negligible memory overhead during RDMA failover.
NIJul 27, 2018
An experiment in distributed Internet address management using blockchainsStefano Angieri, Alberto García-Martínez, Bingyang Liu et al.
The current system to manage the global pool of IP addresses is centralized in five transnational organizations, the Regional Internet Registries (RIRs). Each of these RIRs manage the address pool for a large number of countries. Because the RIRs are private organizations, they are subject to the legal framework of the country where they are based. This configuration results in a jurisdictional overflow from the legal framework of the countries where the RIR is based to all the countries that the RIRs are serving (the countries served by the RIRs de facto become subjects of the legal system of the country where the RIR is hosted). The situation is aggravated by the deployment of new security techniques such as the RPKI and BGPsec, that enable enforcement of allocations by the RIRs. In this paper we present InBlock, a blockchain-based distributed governance body aimed to provide de-centralized management of IP addresses. InBlock also aims to fulfil the same objectives as the current IP address allocation system, namely, uniqueness, fairness, conservation, aggregation, registration and minimized overhead. InBlock is implemented as a Decentralized Autonomous Organization, i.e., as a set of blockchain's smart contracts in Ethereum. Any entity may request an allocation of addresses to the InBlock registry by solely performing a (crypto)currency transfer to the InBlock. The fee required, along with the annual renewal fee, serves as a mechanism to deter stockpiling and other wasteful practices. As with any novel technology, there are many open questions about the usage of blockchains to build an IP address registry. For this reason, we believe that practical experimentation is required in order to have hands-on experiences about such a system. We propose to conduct an experiment on distributed address management using InBlock as a starting point to inform future directions in this area.