Leyang Xue

h-index7

5papers

142citations

Novelty47%

AI Score33

Ranked #119,451 of 194,257 authors (top 61%)#26,268 in LG (top 65%)

5 Papers

5.9DCMar 12, 2025Code

MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching

Tairan Xu, Leyang Xue, Zhan Lu et al.

This paper presents MoE-Gen, a high-throughput MoE inference system optimized for single-GPU execution. Existing inference systems rely on model-based or continuous batching strategies, originally designed for interactive inference, which result in excessively small batches for MoE's key modules-attention and expert modules-leading to poor throughput. To address this, we introduce module-based batching, which accumulates tokens in host memory and dynamically launches large batches on GPUs to maximize utilization. Additionally, we optimize the choice of batch sizes for each module in an MoE to fully overlap GPU computation and communication, maximizing throughput. Evaluation demonstrates that MoE-Gen achieves 8-31x higher throughput compared to state-of-the-art systems employing model-based batching (FlexGen, MoE-Lightning, DeepSpeed), and offers even greater throughput improvements over continuous batching systems (e.g., vLLM and Ollama) on popular MoE models (DeepSeek and Mixtral) across offline inference tasks. MoE-Gen's source code is publicly available at https://github.com/EfficientMoE/MoE-Gen

7.9LGDec 10, 2024

MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems

Yinsicheng Jiang, Yao Fu, Yeqi Huang et al.

The sparse Mixture-of-Experts (MoE) architecture is increasingly favored for scaling Large Language Models (LLMs) efficiently, but it depends on heterogeneous compute and memory resources. These factors jointly affect system Cost, Accuracy, and Performance (CAP), making trade-offs inevitable. Existing benchmarks often fail to capture these trade-offs accurately, complicating practical deployment decisions. To address this, we introduce MoE-CAP, a benchmark specifically designed for MoE systems. Our analysis reveals that achieving an optimal balance across CAP is difficult with current hardware; MoE systems typically optimize two of the three dimensions at the expense of the third-a dynamic we term the MoE-CAP trade-off. To visualize this, we propose the CAP Radar Diagram. We further introduce sparsity-aware performance metrics-Sparse Memory Bandwidth Utilization (S-MBU) and Sparse Model FLOPS Utilization (S-MFU)-to enable accurate performance benchmarking of MoE systems across diverse hardware platforms and deployment scenarios.

33.4LGJan 25, 2024Code

ServerlessLLM: Low-Latency Serverless Inference for Large Language Models

Yao Fu, Leyang Xue, Yeqi Huang et al.

This paper presents ServerlessLLM, a distributed system designed to support low-latency serverless inference for Large Language Models (LLMs). By harnessing the substantial near-GPU storage and memory capacities of inference servers, ServerlessLLM achieves effective local checkpoint storage, minimizing the need for remote checkpoint downloads and ensuring efficient checkpoint loading. The design of ServerlessLLM features three core contributions: (i) \emph{fast multi-tier checkpoint loading}, featuring a new loading-optimized checkpoint format and a multi-tier loading system, fully utilizing the bandwidth of complex storage hierarchies on GPU servers; (ii) \emph{efficient live migration of LLM inference}, which enables newly initiated inferences to capitalize on local checkpoint storage while ensuring minimal user interruption; and (iii) \emph{startup-time-optimized model scheduling}, which assesses the locality statuses of checkpoints on each server and schedules the model onto servers that minimize the time to start the inference. Comprehensive evaluations, including microbenchmarks and real-world scenarios, demonstrate that ServerlessLLM dramatically outperforms state-of-the-art serverless systems, reducing latency by 10 - 200X across various LLM inference workloads.

2.3SOC-PHApr 1, 2019

Enhancing the long-term performance of recommender system

Leyang Xue, Peng Zhang, An Zeng

Recommender system is a critically important tool in online commercial system and provide users with personalized recommendation on items. So far, numerous recommendation algorithms have been made to further improve the recommendation performance in a single-step recommendation, while the long-term recommendation performance is neglected. In this paper, we proposed an approach called Adjustment of Recommendation List (ARL) to enhance the long-term recommendation accuracy. In order to observe the long-term accuracy, we developed an evolution model of network to simulate the interaction between the recommender system and user's behaviour. The result shows that not only long-term recommendation accuracy can be enhanced significantly but the diversity of item in online system maintains healthy. Notably, an optimal parameter n* of ARL existed in long-term recommendation, indicating that there is a trade-off between keeping diversity of item and user's preference to maximize the long-term recommendation accuracy. Finally, we confirmed that the optimal parameter n* is stable during evolving network, which reveals the robustness of ARL method.

3.3SOC-PHMar 29, 2019

Predictability of diffusion-based recommender systems

Peng Zhang, Leyang Xue, An Zeng

The recommendation methods based on network diffusion have been shown to perform well in both recommendation accuracy and diversity. Nowdays, numerous extensions have been made to further improve the performance of such methods. However, to what extent can items be predicted by diffusion-based algorithms still lack of understanding. Here, we mainly propose a method to quantify the predictability of diffusion-based algorithms. Accordingly, we conduct experiments on Movielens and Netflix data sets. The results show that the higher recommendation accuracy based on diffusion algorithms can still be achieved by optimizing the way of resource allocation on a density network. On a sparse network, the possibility of improving accuracy is relatively low due to the fact that the current accuracy of diffusion-based methods is very close its predictability. In this case, we find that the predictability can be improved significantly by multi-steps diffusion, especially for users with less historical information. In contrast to common belief, there are plausible circumstances where the higher predictability of diffusion-based methods do not correspond to those users with more historical recording. Thus, we proposed the diffusion coverage and item average degree to explain this phenomenon. In addition, we demonstrate the recommendation accuracy in real online system is overestimated by random partition used in the literature, suggesting the recommendation in real online system may be a harder task.