Weifeng Gao

h-index5
2papers

2 Papers

DCJun 15, 2025
Serving Large Language Models on Huawei CloudMatrix384

Pengfei Zuo, Huimin Lin, Junbo Deng et al.

The rapid evolution of large language models (LLMs), driven by growing parameter scales, adoption of mixture-of-experts (MoE) architectures, and expanding context lengths, imposes unprecedented demands on AI infrastructure. Traditional AI clusters face limitations in compute intensity, memory bandwidth, inter-chip communication, and latency, compounded by variable workloads and strict service-level objectives. Addressing these issues requires fundamentally redesigned hardware-software integration. This paper introduces Huawei CloudMatrix, a next-generation AI datacenter architecture, realized in the production-grade CloudMatrix384 supernode. It integrates 384 Ascend 910 NPUs and 192 Kunpeng CPUs interconnected via an ultra-high-bandwidth Unified Bus (UB) network, enabling direct all-to-all communication and dynamic pooling of resources. These features optimize performance for communication-intensive operations, such as large-scale MoE expert parallelism and distributed key-value cache access. To fully leverage CloudMatrix384, we propose CloudMatrix-Infer, an advanced LLM serving solution incorporating three core innovations: a peer-to-peer serving architecture that independently scales prefill, decode, and caching; a large-scale expert parallelism strategy supporting EP320 via efficient UB-based token dispatch; and hardware-aware optimizations including specialized operators, microbatch-based pipelining, and INT8 quantization. Evaluation with the DeepSeek-R1 model shows CloudMatrix-Infer achieves state-of-the-art efficiency: prefill throughput of 6,688 tokens/s per NPU and decode throughput of 1,943 tokens/s per NPU (<50 ms TPOT). It effectively balances throughput and latency, sustaining 538 tokens/s per NPU even under stringent 15 ms latency constraints, while INT8 quantization maintains model accuracy across benchmarks.

NIOct 10, 2017
Link Quality Aware Channel Allocation for Multichannel Body Sensor Networks

Weifeng Gao, Zhiwei Zhao, Geyong Min et al.

Body Sensor Network (BSN) is a typical Internet-of-Things (IoT) application for personalized health care. It consists of economically powered, wireless and implanted medical monitoring sensor nodes, which are designed to continually collect the medical information of the target patients. Multichannel is often used in BSNs to reduce the spectrum competition of the tremendous sensor nodes and the problem of channel assignment has attracted much research attention. The health sensing data in BSNs is often required to be delivered to a sink node (or server) before a certain deadline for real time monitoring or health emergency alarm. Therefore, deadline is of significant importance for multichannel allocation and scheduling. The existing works, though designed to meet the deadline, often overlook the impact of the unreliable wireless links. As a result, the health sensing data can still be overdue because of the scheduled lossy links. Besides, potential collisions in the schedules also incur considerable delay in delivering the sensing data. In this paper, we propose a novel deadline- driven Link quality Aware Channel Assignment scheme (LACA), where link quality, deadlines and collisions are jointly considered. LACA prioritizes links with urgent deadlines and heavy collisions. Besides, LACA allows the exploition of the spare slots for retransmissions on lossy links, which can further reduce the retransmission delay. Extensive simulation experiments show that compared to the existing approaches, LACA can better utilize the wireless spectrum and achieve higher packet delivery ratio before the deadline.