Lingfan Yu

LG
h-index7
5papers
1,008citations
Novelty48%
AI Score40

5 Papers

DCJan 12
CRAFT: Cost-aware Expert Replica Allocation with Fine-Grained Layerwise Estimations

Adrian Zhao, Zhenkun Cai, Zhenyu Song et al.

Mixture-of-Experts (MoE) has recently emerged as the mainstream architecture for efficiently scaling large language models while maintaining near-constant computational cost. Expert parallelism distributes parameters by partitioning experts across devices, but this introduces token-level load imbalance during inference. Expert replication is a widely adopted load-balancing technique in serving frameworks that alleviates load imbalance in large-scale deployments by replicating experts with high loads. In this work, we demonstrate that existing replication schemes often over-replicate, with many replicas providing marginal improvement. Replicas consume substantial GPU memory, which may lead to resource contention and throughput degradation. We present CRAFT, an efficient expert replication framework that maximizes load balance under a given memory budget by performing fine-grained, per-layer replication based on the estimated replication benefit. CRAFT can be seamlessly integrated into existing serving frameworks without any additional training or model changes. Our evaluation shows that CRAFT increases end-to-end serving throughput by $1.14\times$ on average (up to $1.2\times$) over existing replication techniques in large-scale deployments with models ranging from hundreds of billions to a trillion parameters.

LGDec 9, 2023
Stateful Large Language Model Serving with Pensieve

Lingfan Yu, Jinkun Lin, Jinyang Li

Large Language Models (LLMs) are wildly popular today and it is important to serve them efficiently. Existing LLM serving systems are stateless across requests. Consequently, when LLMs are used in the common setting of multi-turn conversations, a growing log of the conversation history must be processed alongside any request by the serving system at each turn, resulting in repeated processing. In this paper, we design $Pensieve$, a system optimized for multi-turn conversation LLM serving. $Pensieve$ maintains the conversation state across requests by caching previously processed history to avoid duplicate processing. $Pensieve$'s multi-tier caching strategy can utilize both GPU and CPU memory to efficiently store and retrieve cached data. $Pensieve$ also generalizes the recent PagedAttention kernel to support attention between multiple input tokens with a GPU cache spread over non-contiguous memory. Our evaluation shows that $Pensieve$ can achieve $1.14$-$3.0\times$ the throughput of vLLM and TensorRT-LLM and significantly reduce latency.

LGNov 19, 2020
Scalable Graph Neural Networks for Heterogeneous Graphs

Lingfan Yu, Jiajun Shen, Jinyang Li et al.

Graph neural networks (GNNs) are a popular class of parametric model for learning over graph-structured data. Recent work has argued that GNNs primarily use the graph for feature smoothing, and have shown competitive results on benchmark tasks by simply operating on graph-smoothed node features, rather than using end-to-end learned feature hierarchies that are challenging to scale to large graphs. In this work, we ask whether these results can be extended to heterogeneous graphs, which encode multiple types of relationship between different entities. We propose Neighbor Averaging over Relation Subgraphs (NARS), which trains a classifier on neighbor-averaged features for randomly-sampled subgraphs of the "metagraph" of relations. We describe optimizations to allow these sets of node features to be computed in a memory-efficient way, both at training and inference time. NARS achieves a new state of the art accuracy on several benchmark datasets, outperforming more expensive GNN-based methods

LGSep 3, 2019
Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks

Minjie Wang, Da Zheng, Zihao Ye et al.

Advancing research in the emerging field of deep graph learning requires new tools to support tensor computation over graphs. In this paper, we present the design principles and implementation of Deep Graph Library (DGL). DGL distills the computational patterns of GNNs into a few generalized sparse tensor operations suitable for extensive parallelization. By advocating graph as the central programming abstraction, DGL can perform optimizations transparently. By cautiously adopting a framework-neutral design, DGL allows users to easily port and leverage the existing components across multiple deep learning frameworks. Our evaluation shows that DGL significantly outperforms other popular GNN-oriented frameworks in both speed and memory consumption over a variety of benchmarks and has little overhead for small scale workloads.

CRSep 25, 2017
The Efficient Server Audit Problem, Deduplicated Re-execution, and the Web

Cheng Tan, Lingfan Yu, Joshua B. Leners et al.

You put a program on a concurrent server, but you don't trust the server; later, you get a trace of the actual requests that the server received from its clients and the responses that it delivered. You separately get logs from the server; these are untrusted. How can you use the logs to efficiently _verify_ that the responses were derived from running the program on the requests? This is the _Efficient Server Audit Problem_, and it abstracts real-world scenarios, including running a web application on an untrusted provider. We give a solution based on several new techniques, including simultaneous replay and efficient verification of concurrent executions. We implement the solution for PHP web applications. For several applications, our verifier achieves 5.6--10.9x speedup versus simply re-executing, with less than 10 percent overhead for the server.