96.5DCMay 4Code
SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM InferenceJincheng Xie, Yawen Ling, Qi Xiao et al.
LLM serving platforms are increasingly deployed as multi-model cloud systems, where user demand is often long-tailed: a few popular large models receive most requests, while many smaller tail models remain underutilized. We propose \textbf{SPECTRE} (Parallel \textbf{SPEC}ulative Decoding with a Multi-\textbf{T}enant \textbf{RE}mote Drafter), a serving framework that reuses underutilized tail-model services as remote drafters for heavily loaded large-model services through speculative decoding. SPECTRE enables draft generation and target-side verification to run in parallel, and makes such parallelism effective through three techniques: a hybrid ordinary-parallel speculative decoding strategy guided by a threshold derived from throughput analysis, speculative priority scheduling to preserve draft--target overlap under multi-tenant traffic, and draft-side prompt compression to reduce draft latency. We implement SPECTRE in \texttt{SGLang} and evaluate it across multiple draft--target model pairs, reasoning benchmarks, real-world long-context workloads, and a wide range of batch sizes. Results show that SPECTRE consistently improves large-model serving throughput while causing only minor interference to the native workloads of tail-model services. In large-model deployments, including Qwen3-235B-A22B with TP=8, SPECTRE achieves up to \textbf{2.28$\times$ speedup} over autoregressive decoding and up to an additional \textbf{66\% relative improvement} over the strongest speculative decoding baselines. Talk is cheap, we show you the code: https://github.com/sgl-project/sglang/pull/22272.
86.9SYApr 22
Accurate Frequency Response Modeling in Integrated T&D Co-Simulation via EWMA-RTTA-Based Quadratic ExtrapolationJong Ha Woo, Qi Xiao, Yu Ma et al.
The large-scale integration of inverter-based resources (IBRs), particularly distributed photovoltaics (DPVs), into distribution networks increases the need for integrated transmission and distribution (T&D) co-simulation. A key challenge in such co-simulation lies in accurately modeling system frequency across two asynchronous simulation environments. For example, the transmission system, simulated in the phasor domain, can operate with a simulation timestep of 10 ms, while the distribution system, simulated in the electromagnetic transient domain (EMT) to include IBR models, uses a much finer timestep of 100 microseconds. To ensure accurate PLL-based frequency estimation in distribution systems, it is essential to predict voltage magnitude and phase angle variations within the 10 ms transmission intervals, rather than using constant values that cause inaccurate frequency calculations. This issue becomes particularly critical when modeling primary and secondary frequency response services provided by IBRs. To address this challenge, we propose an automated Exponentially Weighted Moving Average Real-Time Threshold Adaptation (EWMA-RTTA) method, which utilizes Quadratic Extrapolation to predict voltage magnitude and phase angle trends more precisely. The proposed method is validated using two Opal-RT simulators: one simulating an IEEE 118-bus transmission system and the other simulating an IEEE 123-bus distribution network. Simulation results demonstrate that our approach improves the normalized mean absolute error (nMAE) by a factor of 25.7 compared to methods that do not account for time mismatches, offering a scalable and accurate solution for modeling IBR-based frequency response in modern power systems.
LGJun 8, 2020
Supervised Whole DAG Causal DiscoveryHebi Li, Qi Xiao, Jin Tian
We propose to address the task of causal structure learning from data in a supervised manner. Existing work on learning causal directions by supervised learning is restricted to learning pairwise relation, and not well suited for whole DAG discovery. We propose a novel approach of modeling the whole DAG structure discovery as a supervised learning. To fit the problem in hand, we propose to use permutation equivariant models that align well with the problem domain. We evaluate the proposed approach extensively on synthetic graphs of size 10,20,50,100 and real data, and show promising results compared with a variety of previous approaches.
LGMay 26, 2019
Purifying Adversarial Perturbation with Adversarially Trained Auto-encodersHebi Li, Qi Xiao, Shixin Tian et al.
Machine learning models are vulnerable to adversarial examples. Iterative adversarial training has shown promising results against strong white-box attacks. However, adversarial training is very expensive, and every time a model needs to be protected, such expensive training scheme needs to be performed. In this paper, we propose to apply iterative adversarial training scheme to an external auto-encoder, which once trained can be used to protect other models directly. We empirically show that our model outperforms other purifying-based methods against white-box attacks, and transfers well to directly protect other base models with different architectures.