Shipeng Li

CV
h-index8
13papers
332citations
Novelty50%
AI Score44

13 Papers

LGNov 3, 2025
Scaling Graph Chain-of-Thought Reasoning: A Multi-Agent Framework with Efficient LLM Serving

Chengying Huan, Ziheng Meng, Yongchao Liu et al.

Graph Chain-of-Thought (Graph-CoT) enables large language models (LLMs) to perform step-by-step reasoning over graph-structured knowledge, but existing pipelines suffer from low accuracy, excessive token usage, high latency, and low throughput due to single-agent monolithic prompts, repeated context re-encoding, and inefficient serving execution. We present GLM, the first multi-agent Graph-CoT system co-designed with an optimized LLM serving architecture. GLM decomposes reasoning into specialized agents for classification, reasoning, action generation, and graph retrieval, enabling branching and selective context sharing to reduce prompt length and reasoning iterations while preserving reasoning quality, thereby improving accuracy and reducing overall token consumption. To scale inference, we introduce a Graph-CoT-aware LLM inference mechanism with graph-specific KV-cache management, priority-based eviction, and pipelined execution to improve serving efficiency. Experiments demonstrate that GLM improves answer accuracy by up to 38%, reduces token cost by up to 95.7%, lowers inference latency by 90.3%, and achieves up to 15.1x higher throughput compared to state-of-the-art Graph-CoT baselines, enabling efficient adoption for complex real-world reasoning at scale.

LGOct 18, 2024
Revisiting Service Level Objectives and System Level Metrics in Large Language Model Serving

Zhibin Wang, Shipeng Li, Yuhang Zhou et al.

User experience is a critical factor Large Language Model (LLM) serving systems must consider, where service level objectives (SLOs) considering the experience of individual requests and system level metrics (SLMs) considering the overall system performance are two key performance measures. However, we observe two notable issues in existing metrics: 1) manually delaying the delivery of some tokens can improve SLOs, and 2) actively abandoning requests that do not meet SLOs can improve SLMs, both of which are counterintuitive. In this paper, we revisit SLOs and SLMs in LLM serving, and propose a new SLO that aligns with user experience. Based on the SLO, we propose a comprehensive metric framework called smooth goodput, which integrates SLOs and SLMs to reflect the nature of user experience in LLM serving. Through this unified framework, we reassess the performance of different LLM serving systems under multiple workloads. Evaluation results show that our metric framework provides a more comprehensive view of token delivery and request processing, and effectively captures the optimal point of user experience and system performance with different serving strategies.

DCMar 1, 2025
Echo: Efficient Co-Scheduling of Hybrid Online-Offline Tasks for Large Language Model Serving

Zhibin Wang, Shipeng Li, Xue Li et al.

Large language models have been widely deployed in various applications, encompassing both interactive online tasks and batched offline tasks. Given the burstiness and latency sensitivity of online tasks, over-provisioning resources is common practice. This allows for the integration of latency-insensitive offline tasks during periods of low online load, enhancing resource utilization. However, strategically serving online and offline tasks through a preemption mechanism fails to fully leverage the flexibility of offline tasks and suffers from KV cache recomputation and irregular workloads. In this paper, we introduce Echo, a collaborative online-offline task serving system, including a scheduler, a KV cache manager, and estimation toolkits. The scheduler and KV cache manager work tightly to maximize the throughput of offline tasks, while the estimator further predicts execution time to ensure online task SLOs. The scheduler leverages the batch information of last iteration to reduce the search space for finding the optimal schedule. The KV cache manager sets the priority of the KV cache based on the type of tasks and the opportunity of prefix sharing to reduce the recomputation. Finally, the estimation toolkits predict the execution time, future memory consumption, and the throughput of offline tasks to guide the scheduler, KV cache manager, and the system deployer. Evaluation based on real-world workloads demonstrates that Echo can increase offline task throughput by up to $3.3\times$, while satisfying online task SLOs.

LGJun 13, 2025
LearnAlign: Reasoning Data Selection for Reinforcement Learning in Large Language Models Based on Improved Gradient Alignment

Shipeng Li, Shikun Li, Zhiqin Yang et al.

Reinforcement learning (RL) has become a key technique for enhancing LLMs' reasoning abilities, yet its data inefficiency remains a major bottleneck. To address this critical yet challenging issue, we present a novel gradient-alignment-based method, named LearnAlign, which intelligently selects the learnable and representative training reasoning data for RL post-training. To overcome the issue of response-length bias in gradient norms, we introduce the data learnability based on the success rate, which can indicate the learning potential of each data point. Experiments across three mathematical reasoning benchmarks demonstrate that our method significantly reduces training data requirements while achieving minor performance degradation or even improving performance compared to full-data training. For example, it reduces data requirements by up to 1,000 data points with better performance (77.53%) than that on the full dataset on GSM8K benchmark (77.04%). Furthermore, we show its effectiveness in the staged RL setting. This work provides valuable insights into data-efficient RL post-training and establishes a foundation for future research in optimizing reasoning data selection. To facilitate future work, we will release code.

DCOct 15, 2025
Adaptive Rescheduling in Prefill-Decode Disaggregated LLM Inference

Zhibin Wang, Zetao Hong, Xue Li et al.

Large Language Model (LLM) inference has emerged as a fundamental paradigm. In real-world scenarios, variations in output length cause severe workload imbalance in the decode phase, particularly for long-output reasoning tasks. Existing systems, such as PD disaggregation architectures, rely on static prefill-to-decode scheduling, which often results in SLO violations and OOM failures under evolving decode workloads. In this paper, we propose ARES, an adaptive decoding rescheduling system powered by length prediction to anticipate future workloads. Our core contributions include: (1) A lightweight and continuous LLM-native prediction method that leverages LLM hidden state to model remaining generation length with high precision (reducing MAE by 49.42%) and low overhead (cutting predictor parameters by 93.28%); (2) A rescheduling solution in decode phase with : A dynamic balancing mechanism that integrates current and predicted workloads, reducing P99 TPOT by 74.77% and achieving up to 2.24 times higher goodput.

CVOct 21, 2015
Content adaptive screen image scaling

Yao Zhai, Qifei Wang, Yan Lu et al.

This paper proposes an efficient content adaptive screen image scaling scheme for the real-time screen applications like remote desktop and screen sharing. In the proposed screen scaling scheme, a screen content classification step is first introduced to classify the screen image into text and pictorial regions. Afterward, we propose an adaptive shift linear interpolation algorithm to predict the new pixel values with the shift offset adapted to the content type of each pixel. The shift offset for each screen content type is offline optimized by minimizing the theoretical interpolation error based on the training samples respectively. The proposed content adaptive screen image scaling scheme can achieve good visual quality and also keep the low complexity for real-time applications.

CVJan 5, 2015
Group $K$-Means

Jianfeng Wang, Shuicheng Yan, Yi Yang et al.

We study how to learn multiple dictionaries from a dataset, and approximate any data point by the sum of the codewords each chosen from the corresponding dictionary. Although theoretically low approximation errors can be achieved by the global solution, an effective solution has not been well studied in practice. To solve the problem, we propose a simple yet effective algorithm \textit{Group $K$-Means}. Specifically, we take each dictionary, or any two selected dictionaries, as a group of $K$-means cluster centers, and then deal with the approximation issue by minimizing the approximation errors. Besides, we propose a hierarchical initialization for such a non-convex problem. Experimental results well validate the effectiveness of the approach.

CVMay 16, 2014
Optimized Cartesian $K$-Means

Jianfeng Wang, Jingdong Wang, Jingkuan Song et al.

Product quantization-based approaches are effective to encode high-dimensional data points for approximate nearest neighbor search. The space is decomposed into a Cartesian product of low-dimensional subspaces, each of which generates a sub codebook. Data points are encoded as compact binary codes using these sub codebooks, and the distance between two data points can be approximated efficiently from their codes by the precomputed lookup tables. Traditionally, to encode a subvector of a data point in a subspace, only one sub codeword in the corresponding sub codebook is selected, which may impose strict restrictions on the search accuracy. In this paper, we propose a novel approach, named Optimized Cartesian $K$-Means (OCKM), to better encode the data points for more accurate approximate nearest neighbor search. In OCKM, multiple sub codewords are used to encode the subvector of a data point in a subspace. Each sub codeword stems from different sub codebooks in each subspace, which are optimally generated with regards to the minimization of the distortion errors. The high-dimensional data point is then encoded as the concatenation of the indices of multiple sub codewords from all the subspaces. This can provide more flexibility and lower distortion errors than traditional methods. Experimental results on the standard real-life datasets demonstrate the superiority over state-of-the-art approaches for approximate nearest neighbor search.

CVDec 11, 2013
Fast Neighborhood Graph Search using Cartesian Concatenation

Jingdong Wang, Jing Wang, Gang Zeng et al.

In this paper, we propose a new data structure for approximate nearest neighbor search. This structure augments the neighborhood graph with a bridge graph. We propose to exploit Cartesian concatenation to produce a large set of vectors, called bridge vectors, from several small sets of subvectors. Each bridge vector is connected with a few reference vectors near to it, forming a bridge graph. Our approach finds nearest neighbors by simultaneously traversing the neighborhood graph and the bridge graph in the best-first strategy. The success of our approach stems from two factors: the exact nearest neighbor search over a large number of bridge vectors can be done quickly, and the reference vectors connected to a bridge (reference) vector near the query are also likely to be near the query. Experimental results on searching over large scale datasets (SIFT, GIST and HOG) show that our approach outperforms state-of-the-art ANN search algorithms in terms of efficiency and accuracy. The combination of our approach with the IVFADC system also shows superior performance over the BIGANN dataset of $1$ billion SIFT features compared with the best previously published result.

CVDec 11, 2013
Fast Approximate $K$-Means via Cluster Closures

Jingdong Wang, Jing Wang, Qifa Ke et al.

$K$-means, a simple and effective clustering algorithm, is one of the most widely used algorithms in multimedia and computer vision community. Traditional $k$-means is an iterative algorithm---in each iteration new cluster centers are computed and each data point is re-assigned to its nearest center. The cluster re-assignment step becomes prohibitively expensive when the number of data points and cluster centers are large. In this paper, we propose a novel approximate $k$-means algorithm to greatly reduce the computational complexity in the assignment step. Our approach is motivated by the observation that most active points changing their cluster assignments at each iteration are located on or near cluster boundaries. The idea is to efficiently identify those active points by pre-assembling the data into groups of neighboring points using multiple random spatial partition trees, and to use the neighborhood information to construct a closure for each cluster, in such a way only a small number of cluster candidates need to be considered when assigning a data point to its nearest cluster. Using complexity analysis, image data clustering, and applications to image retrieval, we show that our approach out-performs state-of-the-art approximate $k$-means algorithms in terms of clustering quality and efficiency.

CVJul 30, 2013
Scalable $k$-NN graph construction

Jingdong Wang, Jing Wang, Gang Zeng et al.

The $k$-NN graph has played a central role in increasingly popular data-driven techniques for various learning and vision tasks; yet, finding an efficient and effective way to construct $k$-NN graphs remains a challenge, especially for large-scale high-dimensional data. In this paper, we propose a new approach to construct approximate $k$-NN graphs with emphasis in: efficiency and accuracy. We hierarchically and randomly divide the data points into subsets and build an exact neighborhood graph over each subset, achieving a base approximate neighborhood graph; we then repeat this process for several times to generate multiple neighborhood graphs, which are combined to yield a more accurate approximate neighborhood graph. Furthermore, we propose a neighborhood propagation scheme to further enhance the accuracy. We show both theoretical and empirical accuracy and efficiency of our approach to $k$-NN graph construction and demonstrate significant speed-up in dealing with large scale visual data.

CVJul 30, 2013
Hybrid Affinity Propagation

Jingdong Wang, Hao Xu, Xian-Sheng Hua et al.

In this paper, we address a problem of managing tagged images with hybrid summarization. We formulate this problem as finding a few image exemplars to represent the image set semantically and visually, and solve it in a hybrid way by exploiting both visual and textual information associated with images. We propose a novel approach, called homogeneous and heterogeneous message propagation ($\text{H}^\text{2}\text{MP}$). Similar to the affinity propagation (AP) approach, $\text{H}^\text{2}\text{MP}$ reduce the conventional \emph{vector} message propagation to \emph{scalar} message propagation to make the algorithm more efficient. Beyond AP that can only handle homogeneous data, $\text{H}^\text{2}\text{MP}$ generalizes it to exploit extra heterogeneous relations and the generalization is non-trivial as the reduction to scalar messages from vector messages is more challenging. The main advantages of our approach lie in 1) that $\text{H}^\text{2}\text{MP}$ exploits visual similarity and in addition the useful information from the associated tags, including the associations relation between images and tags and the relations within tags, and 2) that the summary is both visually and semantically satisfactory. In addition, our approach can also present a textual summary to a tagged image collection, which can be used to automatically generate a textual description. The experimental results demonstrate the effectiveness and efficiency of the roposed approach.

IRJul 29, 2013
Image Tag Refinement by Regularized Latent Dirichlet Allocation

Jingdong Wang, Jiazhen Zhou, Hao Xu et al.

Tagging is nowadays the most prevalent and practical way to make images searchable. However, in reality many manually-assigned tags are irrelevant to image content and hence are not reliable for applications. A lot of recent efforts have been conducted to refine image tags. In this paper, we propose to do tag refinement from the angle of topic modeling and present a novel graphical model, regularized Latent Dirichlet Allocation (rLDA). In the proposed approach, tag similarity and tag relevance are jointly estimated in an iterative manner, so that they can benefit from each other, and the multi-wise relationships among tags are explored. Moreover, both the statistics of tags and visual affinities of images in the corpus are explored to help topic modeling. We also analyze the superiority of our approach from the deep structure perspective. The experiments on tag ranking and image retrieval demonstrate the advantages of the proposed method.