23.9CLJun 4
AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative DecodingRunheng Liu, Jincheng Xie, Wen Hu et al.
Speculative decoding accelerates generation by verifying multiple drafted tokens in a single target-model forward pass, reducing sequential decoding iterations. Model-free variants avoid auxiliary draft models by reusing text and model states already available during generation, but their speedup depends on the reliability of the constructed drafts. We identify two limitations of existing reuse-based methods: lexically anchored retrieval has limited recall under surface-form variation, and deterministic span copying can be brittle when the retrieved context does not uniquely determine the continuation. We propose \emph{AdaPLD}, a training-free method that adaptively improves both retrieval and draft construction. AdaPLD preserves high-precision lexical reuse while using semantic similarity to recover additional reuse opportunities when lexical matching fails. It further constructs branched reuse hypotheses to account for continuation uncertainty, rather than relying on a single copied span. Across diverse benchmarks, AdaPLD reduces target-model forward passes and achieves up to $3.10\times$ decoding speedup.
27.2DCMay 4Code
SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM InferenceJincheng Xie, Yawen Ling, Qi Xiao et al.
LLM serving platforms are increasingly deployed as multi-model cloud systems, where user demand is often long-tailed: a few popular large models receive most requests, while many smaller tail models remain underutilized. We propose \textbf{SPECTRE} (Parallel \textbf{SPEC}ulative Decoding with a Multi-\textbf{T}enant \textbf{RE}mote Drafter), a serving framework that reuses underutilized tail-model services as remote drafters for heavily loaded large-model services through speculative decoding. SPECTRE enables draft generation and target-side verification to run in parallel, and makes such parallelism effective through three techniques: a hybrid ordinary-parallel speculative decoding strategy guided by a threshold derived from throughput analysis, speculative priority scheduling to preserve draft--target overlap under multi-tenant traffic, and draft-side prompt compression to reduce draft latency. We implement SPECTRE in \texttt{SGLang} and evaluate it across multiple draft--target model pairs, reasoning benchmarks, real-world long-context workloads, and a wide range of batch sizes. Results show that SPECTRE consistently improves large-model serving throughput while causing only minor interference to the native workloads of tail-model services. In large-model deployments, including Qwen3-235B-A22B with TP=8, SPECTRE achieves up to \textbf{2.28$\times$ speedup} over autoregressive decoding and up to an additional \textbf{66\% relative improvement} over the strongest speculative decoding baselines. Talk is cheap, we show you the code: https://github.com/sgl-project/sglang/pull/22272.
8.0AIApr 18
EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding ModelsJincheng Xie, Xingchen Xiao, Heyan Huang et al.
Unified multimodal embedding spaces underpin practical applications such as cross-modal retrieval and zero-shot recognition. In many real deployments, however, supervision is available only for a small subset of modality pairs (e.g., image--text), leaving \emph{unpaired} modality pairs (e.g., audio$\leftrightarrow$depth, infrared$\leftrightarrow$audio) weakly connected and thus performing poorly on zero-shot transfer. Addressing this sparse-pairing regime is therefore essential for scaling unified embedding systems to new tasks without curating exhaustive pairwise data. We propose \textbf{EmergentBridge}, an embedding-level bridging framework that improves performance on these unpaired pairs \emph{without requiring exhaustive pairwise supervision}. Our key observation is that naively aligning a new modality to a synthesized proxy embedding can introduce \emph{gradient interference}, degrading the anchor-alignment structure that existing retrieval/classification relies on. EmergentBridge addresses this by (i) learning a mapping that produces a \emph{noisy bridge anchor} (a proxy embedding of an already-aligned modality) from an anchor embedding, and (ii) enforcing proxy alignment only in the subspace orthogonal to the anchor-alignment direction, preserving anchor alignment while strengthening non-anchor connectivity. Across nine datasets spanning multiple modalities, EmergentBridge consistently outperforms prior binding baselines on zero-shot classification and retrieval, demonstrating strong emergent alignment.
20.4CLApr 21
MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented GenerationXingchen Xiao, Heyan Huang, Runheng Liu et al.
Large language models (LLMs) are widely used in retrieval-augmented generation (RAG) to incorporate external knowledge at inference time. However, when retrieved contexts are noisy, incomplete, or heterogeneous, a single generation process often struggles to reconcile evidence effectively. We propose \textbf{MASS-RAG}, a multi-agent synthesis approach to retrieval-augmented generation that structures evidence processing into multiple role-specialized agents. MASS-RAG applies distinct agents for evidence summarization, evidence extraction, and reasoning over retrieved documents, and combines their outputs through a dedicated synthesis stage to produce the final answer. This design exposes multiple intermediate evidence views, allowing the model to compare and integrate complementary information before answer generation. Experiments on four benchmarks show that MASS-RAG consistently improves performance over strong RAG baselines, particularly in settings where relevant evidence is distributed across retrieved contexts.