Yaodan Xu

DC
h-index5
3papers
16citations
Novelty47%
AI Score36

3 Papers

LGJan 30, 2023
SMDP-Based Dynamic Batching for Efficient Inference on GPU-Based Platforms

Yaodan Xu, Jingzhou Sun, Sheng Zhou et al.

In up-to-date machine learning (ML) applications on cloud or edge computing platforms, batching is an important technique for providing efficient and economical services at scale. In particular, parallel computing resources on the platforms, such as graphics processing units (GPUs), have higher computational and energy efficiency with larger batch sizes. However, larger batch sizes may also result in longer response time, and thus it requires a judicious design. This paper aims to provide a dynamic batching policy that strikes a balance between efficiency and latency. The GPU-based inference service is modeled as a batch service queue with batch-size dependent processing time. Then, the design of dynamic batching is a continuous-time average-cost problem, and is formulated as a semi-Markov decision process (SMDP) with the objective of minimizing the weighted sum of average response time and average power consumption. The optimal policy is acquired by solving an associated discrete-time Markov decision process (MDP) problem with finite state approximation and "discretization". By introducing an abstract cost to reflect the impact of "tail" states, the space complexity and the time complexity of the procedure can decrease by 63.5% and 98%, respectively. Our results show that the optimal policies potentially possess a control limit structure. Numerical results also show that SMDP-based batching policies can adapt to different traffic intensities and outperform other benchmark policies. Furthermore, the proposed solution has notable flexibility in balancing power consumption and latency.

ITApr 22
DiP-SD: Distributed Pipelined Speculative Decoding for Efficient LLM Inference at the Edge

Yaodan Xu, Sheng Zhou, Zhisheng Niu

Speculative decoding has emerged as a promising technique for large language model (LLM) inference by accelerating autoregressive decoding via draft-then-verify. This paper studies a new edge scenario with multi-user inference, where draft tokens are generated locally on devices and subsequently offloaded to a centralized edge server for batch verification. The key challenge is to sustain high throughput under coupled decisions of (i) batching and pipeline scheduling and (ii) per user draft token length. We propose DiP-SD, which exploits two complementary parallelism dimensions: device-level distributed drafting and phase-level draft-verify pipelining. We formulate a throughput-maximization objective, defined as the expected number of accepted tokens per unit time, and jointly optimize the number of batches, user-to-batch assignment, and integer draft lengths. To solve the resulting fractional mixed-integer program, DiP-SD scans the batch number and iteratively alternates between an association subproblem and a draft-length subproblem. Numerical results under a Qwen3-1.7B/Qwen3-32B device-edge deployment show that DiP-SD achieves up to 17.89x throughput over autoregressive decoding (AD) and 1.93x over AD with greedy batching.

DCJan 4, 2025
SMDP-Based Dynamic Batching for Improving Responsiveness and Energy Efficiency of Batch Services

Yaodan Xu, Sheng Zhou, Zhisheng Niu

For servers incorporating parallel computing resources, batching is a pivotal technique for providing efficient and economical services at scale. Parallel computing resources exhibit heightened computational and energy efficiency when operating with larger batch sizes. However, in the realm of online services, the adoption of a larger batch size may lead to longer response times. This paper aims to provide a dynamic batching scheme that delicately balances latency and efficiency. The system is modeled as a batch service queue with size-dependent service times. Then, the design of dynamic batching is formulated as a semi-Markov decision process (SMDP) problem, with the objective of minimizing the weighted sum of average response time and average power consumption. A method is proposed to derive an approximate optimal SMDP solution, representing the chosen dynamic batching policy. By introducing an abstract cost to reflect the impact of "tail" states, the space complexity and the time complexity of the procedure can decrease by 63.5% and 98%, respectively. Numerical results showcase the superiority of SMDP-based batching policies across various parameter setups. Additionally, the proposed scheme exhibits noteworthy flexibility in balancing power consumption and latency.