Wenfei Wu

DC
h-index6
11papers
54citations
Novelty62%
AI Score55

11 Papers

78.6DCMay 18
EPIC: Abstraction and Polymorphism of In-Network Collectives on Ethernet

Yitao Yuan, Jianglong Nie, Tianyu Bai et al.

In-Network Collective (INC) acceleration holds immense potential for optimizing AI training and inference; however, its cross-layer nature has historically hindered investment and adoption within the open Ethernet ecosystem. To bridge this gap, we propose EPIC (Ethernet Polymorphic In-network Collective), an INC protocol specification and reference system built on the principle of "Unified Abstraction, Polymorphic Realization." EPIC introduces an abstraction compatible with standard Ethernet that aligns functional boundaries with participant roles, while offering polymorphic realizations tailored to varying hardware capabilities. We address three fundamental challenges: first, we employ a modular design that enables an evolutionary path from simple to complex implementations, allowing vendors to iterate their hardware incrementally; second, we apply formal verification methodologies to prove the correctness of all proposed polymorphic modes; and third, we develop a unified resource management model versatile enough for diverse INC scenarios. Extensive validation -- spanning model checking, packet/flow simulations, VM emulation, Tofino Testbed, and FPGA/RTL verification -- confirms EPIC's correctness, performance gain, and feasibility.

62.0DCMay 15
CascadeInfer: Length-Aware Scheduling of LLM Serving with Low Latency and Load Balancing

Yitao Yuan, Chenqi Zhao, Bohan Zhao et al.

Efficiently harnessing GPU compute is critical to improving user experience and reducing operational costs in large language model (LLM) services. However, current inference engine schedulers overlook the attention backend's sensitivity to request-length heterogeneity within a batch. As state-of-the-art models now support context windows exceeding 128K tokens, this once-tolerable inefficiency has escalated into a primary system bottleneck, causing severe performance degradation through GPU underutilization and increased latency. We present CascadeInfer, a runtime system that dynamically reschedules requests across multiple instances serving the same LLM to mitigate per-instance length heterogeneity. CascadeInfer partitions these instances into length-specialized groups, each handling requests within a designated length range, naturally forming a pipeline as requests flow through them. CascadeInfer devises a dynamic programming algorithm to efficiently find the stage partition with the best QoE, employs runtime range refinement together with decentralized load (re)balance both across and within groups, achieving a balanced and efficient multi-instance service. Our evaluation shows that, under the same configuration, CascadeInfer reduces end-to-end latency by up to 67% and tail latency by up to 69%, while improving overall system throughput by up to 2.89 times compared to the state-of-the-art multi-instance scheduling systems.

68.9LGMay 12
Not All Tokens Are Worth Caching: Learning Semantic-Aware Eviction for LLM Prefix Caches

Shaoke Fang, Ziang Li, Wenfei Wu et al.

Prefix caching is a key optimization in Large Language Model (LLM) serving, reusing attention Key-Value (KV) states across requests with shared prompt prefixes to reduce expensive prefill computation. However, its benefit depends critically on the eviction policy as GPU memory is scarce, and existing policies such as LRU largely treat cached blocks uniformly. This view ignores a fundamental property of LLM prompts: not all tokens are equally worth caching. We show that different token types within a prompt, including system prompts, user queries, tool outputs, model responses, and chain-of-thought reasoning, exhibit up to 756x variation in reuse rates, yet no existing eviction policy exploits this signal. In this paper, we present SAECache (Semantic-Adaptive Eviction for prefix caches), a semantic-adaptive prefix cache eviction policy that addresses this gap through three innovations: (1) a multi-queue architecture that routes KV blocks to task-specific queues with tailored priority metrics, capturing both session reuse in multi-turn requests and structural reuse in templated single-turn requests; (2) a semantic-aware token weighting mechanism that learns the reuse value of different token types online through eviction feedback; and (3) a fully adaptive online learning schema for all parameter updates, including log-normal timing parameters, position decay power, queue weights, and meta-parameters, which eliminates manual tuning and enables automatic adaptation to deployment-specific workload characteristics. Through extensive evaluation across heterogeneous workloads, we demonstrate that SAECache achieves 1.4x-2.7x TTFT improvement over production-style baselines, while fixed-parameter alternatives can degrade by up to 2.7x under workload mismatch -- a failure mode our adaptive approach avoids entirely.

PLMar 6
Model2Kernel: Model-Aware Symbolic Execution For Safe CUDA Kernels

Mengting He, Shihao Xia, Haomin Jia et al.

The widespread adoption of large language models (LLMs) has made GPU-accelerated inference a critical part of modern computing infrastructure. Production inference systems rely on CUDA kernels to implement core transformer operations, yet these kernels are highly susceptible to memory-safety bugs due to model-dependent tensor layouts, intricate memory indexing, and massive thread-level parallelism. Such bugs can corrupt model weights, crash inference services, or even enable adversarial attacks. Existing techniques either depend on unavailable hardware, incur high overhead, or fail to handle kernel inputs with variable lengths, and none can effectively detect CUDA memory bugs in LLM inference systems. This paper presents Model2Kernel, the first practical system for automatically verifying the memory safety of CUDA kernels used in LLM inference. Model2Kernel performs model-aware dynamic analysis to determine how each model invokes kernels and to classify kernel arguments as either fixed by the model architecture or controlled by model users. Using this information, Model2Kernel then applies CUDA-specialized symbolic execution, supported by new abstractions for dynamic tensor memory and thread identifiers, to accurately pinpoint memory bugs in kernels. In the evaluation on CUDA kernels and models from vLLM, Hugging Face, and recent LLM research papers, Model2Kernel discovers 353 previously unknown bugs while producing only nine false positives, demonstrating its effectiveness.

DCFeb 12, 2024
Accelerating Distributed Deep Learning using Lossless Homomorphic Compression

Haoyu Li, Yuchen Xu, Jiayi Chen et al.

As deep neural networks (DNNs) grow in complexity and size, the resultant increase in communication overhead during distributed training has become a significant bottleneck, challenging the scalability of distributed training systems. Existing solutions, while aiming to mitigate this bottleneck through worker-level compression and in-network aggregation, fall short due to their inability to efficiently reconcile the trade-offs between compression effectiveness and computational overhead, hindering overall performance and scalability. In this paper, we introduce a novel compression algorithm that effectively merges worker-level compression with in-network aggregation. Our solution is both homomorphic, allowing for efficient in-network aggregation without CPU/GPU processing, and lossless, ensuring no compromise on training accuracy. Theoretically optimal in compression and computational efficiency, our approach is empirically validated across diverse DNN models such as NCF, LSTM, VGG19, and BERT-base, showing up to a 6.33$\times$ improvement in aggregation throughput and a 3.74$\times$ increase in per-iteration training speed.

CRAug 11, 2025
Selective KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference

Kexin Chu, Zecheng Lin, Dawei Xiang et al.

Global KV-cache sharing has emerged as a key optimization for accelerating large language model (LLM) inference. However, it exposes a new class of timing side-channel attacks, enabling adversaries to infer sensitive user inputs via shared cache entries. Existing defenses, such as per-user isolation, eliminate leakage but degrade performance by up to 38.9% in time-to-first-token (TTFT), making them impractical for high-throughput deployment. To address this gap, we introduce SafeKV (Secure and Flexible KV Cache Sharing), a privacy-aware KV-cache management framework that selectively shares non-sensitive entries while confining sensitive content to private caches. SafeKV comprises three components: (i) a hybrid, multi-tier detection pipeline that integrates rule-based pattern matching, a general-purpose privacy detector, and context-aware validation; (ii) a unified radix-tree index that manages public and private entries across heterogeneous memory tiers (HBM, DRAM, SSD); and (iii) entropy-based access monitoring to detect and mitigate residual information leakage. Our evaluation shows that SafeKV mitigates 94% - 97% of timing-based side-channel attacks. Compared to per-user isolation method, SafeKV improves TTFT by up to 40.58% and throughput by up to 2.66X across diverse LLMs and workloads. SafeKV reduces cache-induced TTFT overhead from 50.41% to 11.74% on Qwen3-235B. By combining fine-grained privacy control with high cache reuse efficiency, SafeKV reclaims the performance advantages of global sharing while providing robust runtime privacy guarantees for LLM inference.

55.5DCApr 21
Verifying In-Network Computing Systems for Design Risks

Tianyu Bai, Ying Zhang, Xiaoxi Zhang et al.

The emergence of programmable switches has brought in-network computing (INC) into the spotlight in recent years. By offloading computation directly onto the data transmission process, INC improves network utilization, reduces latency to sub-RTT levels, saves link bandwidth, and maintains throughput. However, INC disrupts the transparency of traditional networks, forcing developers to consider network exceptions like packet loss and out-of-order. If not properly handled, these exceptions can lead to violations of application properties, such as cache consistency and lock exclusion. Usual testing cannot exhaustively cover these exceptions, raising doubts about the correctness of INC systems and hindering their deployment in the industry. This paper presents INCGuard, the first general-purpose tool for verifying INC systems. INCGuard provides a high-level specification language and saves developers 67.2% lines of code on average. To help better understand the behavior of the system, INCGuard offers configurable network environments. INCGuard enables developers to express INC-specific correctness properties. INCGuard translates developer-specified systems into state transition representations, performs model checking to detect potential design risks, and reports violation traces to developers. We propose optimizations for INC-specific scenarios to address the challenge of state space explosion. We modeled seven INC systems and identified their risks with INCGuard in seconds. We further reproduce them in real systems to confirm the validity of our verification result.

81.1NIApr 9
LCMP: Distributed Long-Haul Cost-Aware Multi-Path Routing for Inter-Datacenter RDMA Networks

Dong-Yang Yu, Yuchao Zhang, Xiaodi Wang et al.

RDMA-empowered cloud services are gradually deployed across datacenters (DCs) with multiple paths, which exhibit new properties of path asymmetry, delayed congestion signals, and simultaneous flow routing collisions, and further fail existing routing methods. We present LCMP, a distributed long-haul cost-aware multi-path routing framework that aims to place RDMA flows on multiple inter-DC paths, achieving low-cost, low-latency, and congestion-responsive transmission. LCMP combines a control-plane path-quality score with compact on-switch congestion signals, where the former unifies quality assessment for asymmetric paths and the latter enables responsive reaction to path congestion. LCMP further resolves the simultaneous flow decision collision problem by filtering high-cost candidates, and performing a diversity-preserving hash inside the reduced set. On an 8-DC testbed, LCMP reduces median and tail FCT slowdown by up to 76% and 64%, respectively compared to state-of-the-art (SOTA) DCN routing strategies. And large-scale NS-3 simulations under the 2000 km inter-DC scenario confirm similar improvements.

DCJan 17, 2022
Efficient Data-Plane Memory Scheduling for In-Network Aggregation

Hao Wang, Yuxuan Qin, ChonLam Lao et al.

As the scale of distributed training grows, communication becomes a bottleneck. To accelerate the communication, recent works introduce In-Network Aggregation (INA), which moves the gradients summation into network middle-boxes, e.g., programmable switches to reduce the traffic volume. However, switch memory is scarce compared to the volume of gradients transmitted in distributed training. Although literature applies methods like pool-based streaming or dynamic sharing to tackle the mismatch, switch memory is still a potential performance bottleneck. Furthermore, we observe the under-utilization of switch memory due to the synchronization requirement for aggregator deallocation in recent works. To improve the switch memory utilization, we propose ESA, an $\underline{E}$fficient Switch Memory $\underline{S}$cheduler for In-Network $\underline{A}$ggregation. At its cores, ESA enforces the preemptive aggregator allocation primitive and introduces priority scheduling at the data-plane, which improves the switch memory utilization and average job completion time (JCT). Experiments show that ESA can improve the average JCT by up to $1.35\times$.

OCNov 15, 2021
Simultaneously Achieving Sublinear Regret and Constraint Violations for Online Convex Optimization with Time-varying Constraints

Qingsong Liu, Wenfei Wu, Longbo Huang et al.

In this paper, we develop a novel virtual-queue-based online algorithm for online convex optimization (OCO) problems with long-term and time-varying constraints and conduct a performance analysis with respect to the dynamic regret and constraint violations. We design a new update rule of dual variables and a new way of incorporating time-varying constraint functions into the dual variables. To the best of our knowledge, our algorithm is the first parameter-free algorithm to simultaneously achieve sublinear dynamic regret and constraint violations. Our proposed algorithm also outperforms the state-of-the-art results in many aspects, e.g., our algorithm does not require the Slater condition. Meanwhile, for a group of practical and widely-studied constrained OCO problems in which the variation of consecutive constraints is smooth enough across time, our algorithm achieves $O(1)$ constraint violations. Furthermore, we extend our algorithm and analysis to the case when the time horizon $T$ is unknown. Finally, numerical experiments are conducted to validate the theoretical guarantees of our algorithm, and some applications of our proposed framework will be outlined.

RONov 16, 2020
Indoor Positioning System based on Visible Light Communication for Mobile Robot in Nuclear Power Plant

Hongyun Xie, Linyi Huang, Wenfei Wu

Visible light positioning (VLP) is widely believed to be a cost-effective answer to the growing demanded for robot indoor positioning. Considering that some extreme environments require robot to be equipped with a precise and radiation-resistance indoor positioning system for doing difficult work, a novel VLP system with high accuracy is proposed to realize the long-playing inspection and intervention under radiation environment. The proposed system with sufficient radiation-tolerance is critical for operational inspection, maintenance and intervention tasks in nuclear facilities. Firstly, we designed intelligent LED lamp with visible light communication (VLC) function to dynamically create the indoor GPS tracking system. By installing the proposed lamps that replace standard lighting in key locations in the nuclear power plant, the proposed system can strengthen the safety of mobile robot and help for efficient inspection in the large-scale field. Secondly, in order to enhance the radiation-tolerance and multi-scenario of the proposed system, we proposed a shielding protection method for the camera vertically installed on the robot, which ensures that the image elements of the camera namely the captured VLP information is not affected by radiation. Besides, with the optimized visible light positioning algorithm based on dispersion calibration method, the proposed VLP system can achieve an average positioning accuracy of 0.82cm and ensure that 90% positioning errors are less than 1.417cm. Therefore, the proposed system not only has sufficient radiation-tolerance but achieve state-of-the-art positioning accuracy in the visible light positioning field.