Zikun Li

CE
h-index11
5papers
57citations
Novelty56%
AI Score50

5 Papers

DCMay 5
Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs

Yixuan Mei, Zikun Li, Zixuan Chen et al.

The usage of large language models (LLMs) has grown increasingly fragmented, with no single model dominating. Meanwhile, cloud providers offer a wide range of mid-tier and older-generation GPUs that enjoy better availability and deliver comparable performance per dollar to top-tier hardware. To efficiently harness these heterogeneous resources for serving multiple LLMs concurrently, we introduce Coral, an adaptive heterogeneity-aware multi-LLM serving system. The key idea behind Coral is to jointly optimize resource allocation and the serving strategy of each model replica across all models. To keep pace with shifting throughput demand and resource availability, Coral applies a lossless two-stage decomposition that preserves joint optimality while cutting online solve time from hours to tens of seconds. Our evaluation across 6 models and 20 GPU configurations shows that Coral reduces serving cost by up to 2.79$\times$ over the best baseline, and delivers up to 2.39$\times$ higher goodput under scarce resource availability.

QUANT-PHJul 17, 2023
Quarl: A Learning-Based Quantum Circuit Optimizer

Zikun Li, Jinjun Peng, Yixuan Mei et al.

Optimizing quantum circuits is challenging due to the very large search space of functionally equivalent circuits and the necessity of applying transformations that temporarily decrease performance to achieve a final performance improvement. This paper presents Quarl, a learning-based quantum circuit optimizer. Applying reinforcement learning (RL) to quantum circuit optimization raises two main challenges: the large and varying action space and the non-uniform state representation. Quarl addresses these issues with a novel neural architecture and RL-training procedure. Our neural architecture decomposes the action space into two parts and leverages graph neural networks in its state representation, both of which are guided by the intuition that optimization decisions can be mostly guided by local reasoning while allowing global circuit-wide reasoning. Our evaluation shows that Quarl significantly outperforms existing circuit optimizers on almost all benchmark circuits. Surprisingly, Quarl can learn to perform rotation merging, a complex, non-local circuit optimization implemented as a separate pass in existing optimizers.

CEMar 18Code
CICDWOA: A Collective Cognitive Sharing Whale Optimization Algorithm with Cauchy Inverse Cumulative Distribution for 2D/3D Path Planning and Engineering Design Problems

Junhao Wei, Yanxiao Li, Seyedali Mirjalili et al.

The Whale Optimization Algorithm (WOA) has shown strong optimization ability but still suffers from premature convergence and weak search diversity. To address these issues, this paper proposes an enhanced WOA variant called CICDWOA. The proposed algorithm introduces a Good Nodes Set (GNS) method for uniform population initialization, a Collective Cognitive Sharing (CCS) mechanism to enhance group collaboration, and an Enhanced Spiral Updating strategy based on the Cauchy Inverse Cumulative Distribution (CICD) to strengthen global exploration and local exploitation balance. In addition, a nonlinear convergence factor and a Hybrid Gaussian-Cauchy mutation based on Differential Evolution (DE) further improve convergence efficiency and population diversity. CICDWOA was evaluated on 23 benchmark functions, 2D robot path planning problems, 3D UAV path planning tasks and 10 engineering design problems. Statistical experiment results show that CICDWOA achieves faster convergence, higher accuracy, and better robustness than classical WOA and other advanced metaheuristic algorithms. CICDWOA gained average Friedman value of 1.6790, ranking first among the SOTA algorithms. And the results of engineering simulations confirm that CICDWOA provides an effective and general framework for solving complex optimization and engineering problems. The code of CICDWOA are available on \href{URL}{https://github.com/JunhaoWei-mpu/ROBIS-Lab/tree/CICDWOA}.

ROMay 19
KIO-planner: Attention-Guided Single-Stage Motion Planning with Dual Mapping for UAV Navigation

Dexing Yao, Haochen Li, Junhao Wei et al.

Autonomous UAV flight in confined, wall-dense environments requires low-latency and reliable motion planning under strict safety constraints. Traditional optimization-based planners suffer from mapping latency and easily fall into local minima when navigating through dense structural obstacles. Meanwhile, existing end-to-end learning methods struggle to extract fine-grained geometric features from raw depth images and lack hard kinodynamic constraints, leading to unpredictable collisions near walls. To address these issues, we propose KIO-planner, an attention-guided single-stage trajectory planning framework. First, we integrate a Convolutional Block Attention Module (CBAM) into the perception backbone to adaptively focus on critical structural edges and traversable space. Second, we introduce a novel Dual Mapping mechanism--comprising physical bounds activation and a deterministic Geometric Safety Shield in the depth-pixel space--to enforce kinodynamic feasibility and collision-free flight without global map fusion. Extensive high-fidelity simulated experiments demonstrate that KIO-planner enables highly agile navigation at speeds up to 3.0 m/s. Compared to the state-of-the-art baseline, KIO-planner achieves lower inference latency (approximately 24 ms) and generates significantly smoother trajectories, reducing control cost by 28.4%. Most notably, our Dual Mapping substantially increases the worst-case safety margin, measured by minimum distance to obstacles, from 0.48 m to 0.76 m, ensuring fast, smooth, and safer navigation in highly constrained environments.

CLJan 21, 2025
AdaServe: Accelerating Multi-SLO LLM Serving with SLO-Customized Speculative Decoding

Zikun Li, Zhuofu Chen, Remi Delacourt et al.

Modern large language model (LLM) applications exhibit diverse service-level objectives (SLOs), from low-latency requirements in interactive coding assistants to more relaxed constraints in data wrangling tasks. Existing LLM serving systems, which rely on uniform batching and scheduling strategies, often fail to meet these heterogeneous SLOs concurrently. We present AdaServe, the first LLM serving system designed to support efficient multi-SLO serving through SLO-customized speculative decoding. AdaServe formulates multi-SLO serving as a constrained optimization problem and introduces a hardware-aware algorithm that constructs a speculation tree tailored to each request's latency target. It features a speculate-select-verify pipeline that enables fine-grained control over decoding speed while maximizing system throughput. AdaServe further adapts to workload variation by dynamically adjusting speculation parameters. Evaluations across diverse workloads show that AdaServe reduces SLO violations by up to 4.3$\times$ and improves goodput by up to 1.9$\times$ compared to the best performing baselines, highlighting its effectiveness in multi-SLO serving.