TurboSpec: Closed-loop Speculation Control System for Optimizing LLM Serving Goodput
This work addresses the need for robust speculative decoding in LLM serving systems to improve efficiency without expert tuning, though it is incremental as it builds on existing speculation methods.
The paper tackles the problem of speculative decoding degrading LLM serving performance due to overhead and token misses, and presents TurboSpec, a closed-loop control system that dynamically adjusts intra-request parallelism to optimize goodput, achieving consistent performance improvements across diverse workloads and hardware configurations.
Large Language Model (LLM) serving systems batch concurrent user requests to achieve efficient serving. However, in real-world deployments, such inter-request parallelism from batching is often limited by external factors such as low request rates or memory constraints. Recent works focus on intra-request parallelism from speculative decoding as a solution to this problem. Unfortunately, benefits from intra-request parallelism are often fragile, as speculative decoding causes overhead, and speculated tokens may miss. We observe that speculative decoding may degrade LLM serving performance if added naively without tuning to the incoming requests and the speculation method. To alleviate the need for expert tuning and make speculative decoding more robust, we present TurboSpec, a speculation control system that automatically profiles the execution environment and utilizes a feedback-based algorithm to dynamically adjust the amount of intra-request parallelism in LLM serving. TurboSpec predicts "goodput" - the amount of successfully generated tokens - to evaluate and adjust intra-request parallelism amount to that with the highest goodput in runtime. We implement TurboSpec on a real-world LLM serving system vLLM and demonstrate its effectiveness across diverse workloads and hardware configurations, providing consistent performance improvements across all test scenarios.