95.8LGMay 19
Draft Less, Retrieve More: Hybrid Tree Construction for Speculative DecodingYuhao Shen, Tianyu Liu, Xinyi Hu et al.
Speculative decoding (SD) accelerates large language model inference by leveraging a draft-then-verify paradigm. To maximize the acceptance rate, recent methods construct expansive draft trees, which unfortunately incur severe VRAM bandwidth and computational overheads that bottleneck end-to-end speedups. While dynamic-depth pruning can reduce this latency by removing marginal branches, it also discards potentially valid candidates, preventing the acceptance rate from reaching the upper bound of dense trees. In this paper, we identify a critical opportunity in resource allocation: the transition from dense to pruned drafting frees up significant computational budget. To break this Pareto tradeoff, we introduce Graft, a compensation framework that couples pruning and retrieval as mutually reinforcing operations. Pruning supplies sufficient budget for retrieval, while retrieval compensates for pruning-induced coverage loss and recovers accepted length. By employing a sequential `prune-then-graft' mechanism, Graft attaches highly predictive retrieved tokens into positions opened by pruning, filling the topological gaps with near-zero overhead. Graft is entirely training-free and lossless. Comprehensive evaluations show that Graft establishes a new Pareto frontier across practical deployment settings, including short-context generation, long-context generation, and large-scale models. On short-context benchmarks, it achieves up to 5.41$\times$ speedup and improves average speedup over EAGLE-3 by up to 21.8% on the large-scale Qwen3-235B. We also provide a preliminary exploration of applying Graft to the DFlash-style block drafting paradigm, offering initial evidence and insights for extending grafting beyond autoregressive draft trees.
99.8DCMar 10
ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency ScenariosXinyi Hu, Yuhao Shen, Baolin Zhang et al.
Speculative Decoding promises to accelerate the inference of Large Language Models, yet its efficacy often degrades in production-grade serving. Existing evaluations typically overlook the compute-bound nature of high-concurrency regimes, where verification compute becomes the dominant bottleneck. Consequently, prior methods face a dilemma: static trees incur massive verification waste, while dynamic trees suffer from cumulative misjudgments and kernel incompatibility. To bridge this gap, we introduce ECHO, a high concurrency-oriented framework integrated into SGLang that reformulates speculative execution as a budgeted scheduling problem. Crucially, ECHO employs sparse confidence gating to manage the batch as a unified super-tree, elastically pivoting budget between depth and width to co-optimize the trade-off between reducing global verification steps and maximizing per-step efficiency. Extensive evaluations across diverse model scales-particularly the industrial-grade Qwen3-235B-demonstrate that ECHO consistently outperforms SOTA methods in both low-load and high-load scenarios, achieving up to 5.35x walltime speedup and delivering over 20% relative speedup gain.
CROct 8, 2025
EMPalm: Exfiltrating Palm Biometric Data via Electromagnetic Side-ChannelsHaowen Xu, Tianya Zhao, Xuyu Wang et al.
Palm recognition has emerged as a dominant biometric authentication technology in critical infrastructure. These systems operate in either single-modal form, using palmprint or palmvein individually, or dual-modal form, fusing the two modalities. Despite this diversity, they share similar hardware architectures that inadvertently emit electromagnetic (EM) signals during operation. Our research reveals that these EM emissions leak palm biometric information, motivating us to develop EMPalm--an attack framework that covertly recovers both palmprint and palmvein images from eavesdropped EM signals. Specifically, we first separate the interleaved transmissions of the two modalities, identify and combine their informative frequency bands, and reconstruct the images. To further enhance fidelity, we employ a diffusion model to restore fine-grained biometric features unique to each domain. Evaluations on seven prototype and two commercial palm acquisition devices show that EMPalm can recover palm biometric information with high visual fidelity, achieving SSIM scores up to 0.79, PSNR up to 29.88 dB, and FID scores as low as 6.82 across all tested devices, metrics that collectively demonstrate strong structural similarity, high signal quality, and low perceptual discrepancy. To assess the practical implications of the attack, we further evaluate it against four state-of-the-art palm recognition models, achieving a model-wise average spoofing success rate of 65.30% over 6,000 samples from 100 distinct users.
CRFeb 3
Generalizable and Interpretable RF Fingerprinting with Shapelet-Enhanced Large Language ModelsTianya Zhao, Junqing Zhang, Haowen Xu et al.
Deep neural networks (DNNs) have achieved remarkable success in radio frequency (RF) fingerprinting for wireless device authentication. However, their practical deployment faces two major limitations: domain shift, where models trained in one environment struggle to generalize to others, and the black-box nature of DNNs, which limits interpretability. To address these issues, we propose a novel framework that integrates a group of variable-length two-dimensional (2D) shapelets with a pre-trained large language model (LLM) to achieve efficient, interpretable, and generalizable RF fingerprinting. The 2D shapelets explicitly capture diverse local temporal patterns across the in-phase and quadrature (I/Q) components, providing compact and interpretable representations. Complementarily, the pre-trained LLM captures more long-range dependencies and global contextual information, enabling strong generalization with minimal training overhead. Moreover, our framework also supports prototype generation for few-shot inference, enhancing cross-domain performance without additional retraining. To evaluate the effectiveness of our proposed method, we conduct extensive experiments on six datasets across various protocols and domains. The results show that our method achieves superior standard and few-shot performance across both source and unseen domains.
NAJul 31, 2018
Lattice Approximations of Semilinear Stochastic Elliptic Equations with ReflectionJun Dai, Jing Zhang
We study lattice approximations of reflected stochastic elliptic equations driven by white noise on a bounded domain in $\mathbb{R}^d,\ d=1,2,3$. The convergence of the scheme is established.
73.8CVMay 8
AsyncEvGS: Asynchronous Event-Assisted Gaussian Splatting for Handheld Motion-Blurred ScenesJun Dai, Renbiao Jin, Bo Xu et al.
3D reconstruction methods such as 3D Gaussian Splatting (3DGS) and Neural Radiance Fields (NeRF) achieve impressive photorealism but fail when input images suffer from severe motion blur. While event cameras provide high-temporal-resolution motion cues, existing event-assisted approaches rely on low-resolution sensors and strict synchronization, limiting their practicality for handheld 3D capture on common devices, such as smartphones. We introduce a flexible, high-resolution asynchronous RGB-Event dual-camera system and a corresponding reconstruction framework. Our approach first reconstructs sharp images from the event data and then employs a cross-domain pose estimation module based on the Visual Geometry Transformer (VGGT) to obtain robust initialization for 3DGS. During optimization, we employ a structure-driven event loss and view-specific consistency regularizers to mitigate the ill-posed behavior of traditional event losses and deblurring losses, ensuring both stable and high-fidelity reconstruction. We further contribute AsyncEv-Deblur, a new high-resolution RGB-Event dataset captured with our asynchronous system. Experiments demonstrate that our method achieves state-of-the-art performance on both our challenging dataset and existing benchmarks, substantially improving reconstruction robustness under severe motion blur. Project page: https://openimaginglab.github.io/AsyncEvGS/
QUANT-PHOct 30, 2025
FlowQ-Net: A Generative Framework for Automated Quantum Circuit DesignJun Dai, Michael Rizvi-Martel, Guillaume Rabusseau
Designing efficient quantum circuits is a central bottleneck to exploring the potential of quantum computing, particularly for noisy intermediate-scale quantum (NISQ) devices, where circuit efficiency and resilience to errors are paramount. The search space of gate sequences grows combinatorially, and handcrafted templates often waste scarce qubit and depth budgets. We introduce \textsc{FlowQ-Net} (Flow-based Quantum design Network), a generative framework for automated quantum circuit synthesis based on Generative Flow Networks (GFlowNets). This framework learns a stochastic policy to construct circuits sequentially, sampling them in proportion to a flexible, user-defined reward function that can encode multiple design objectives such as performance, depth, and gate count. This approach uniquely enables the generation of a diverse ensemble of high-quality circuits, moving beyond single-solution optimization. We demonstrate the efficacy of \textsc{FlowQ-Net} through an extensive set of simulations. We apply our method to Variational Quantum Algorithm (VQA) ansatz design for molecular ground state estimation, Max-Cut, and image classification, key challenges in near-term quantum computing. Circuits designed by \textsc{FlowQ-Net} achieve significant improvements, yielding circuits that are 10$\times$-30$\times$ more compact in terms of parameters, gates, and depth compared to commonly used unitary baselines, without compromising accuracy. This trend holds even when subjected to error profiles from real-world quantum devices. Our results underline the potential of generative models as a general-purpose methodology for automated quantum circuit design, offering a promising path towards more efficient quantum algorithms and accelerating scientific discovery in the quantum domain.
71.1CLApr 29
When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding?Tianyu Liu, Yuhao Shen, Xinyi Hu et al.
Speculative decoding accelerates LLM inference, but SOTA hidden-state-based drafters suffer from long-range decay: draft accuracy degrades as the speculative step increases. Existing work attributes this decay to train-inference mismatch and proposes test-time training (TTT) as a remedy, yet we observe that long-range decay persists even in TTT-trained drafters. We revisit long-range decay from the perspective of context information preservation. In hidden-state reuse, we argue the target hidden state acts as a biased context compression: it aggregates historical token information according to the attention query at the current position, yielding a compact representation optimized for immediate next-token prediction. This compression can suppress information less relevant to the current query but important for later speculative steps. In contrast, the target model's KV cache serves as an explicit context, retaining the complete set of token-wise KV representations. We therefore posit the KV-Reuse Hypothesis: allowing the draft model to reuse the target KV cache can provide richer signals for long-horizon drafting. To test this hypothesis, we introduce KVShot, a diagnostic framework that compares three reuse paradigms: hidden-only, KV-only, and hybrid. Extensive evaluations on Qwen3-8B show that KV-Reuse improves long-range acceptance, although end-to-end speedups remain marginal under current training pipelines. Our analysis identifies two key structural bottlenecks: shallow drafters struggle to estimate target queries accurately, and draft-side KV projections receive sparse gradient signals. These findings suggest that realizing the full potential of KV-aware decoding requires moving beyond TTT toward block-wise training paradigms. By exposing these bottlenecks, KVShot provides a foundational diagnostic testbed and a clear roadmap for designing next-generation inference architectures.
87.5NIApr 27
MatchRDMA: A Segmented and Rate-Matched Long-Haul RDMA Scheme for Geo-distributed LLM Training over OTNJun Dai, Xiaorun Wang, Xingde Li et al.
We propose MatchRDMA, a proactive, segmented, and rate-matched long-haul RDMA scheme for geo-distributed LLM training over OTN. By coordinating source and destination OTN rates, it improves inter-DC throughput by up to 20x compared with conventional RDMA, and reduces destination-OTN buffer occupancy by up to 62.7%.
CRNov 28, 2024
RevPRAG: Revealing Poisoning Attacks in Retrieval-Augmented Generation through LLM Activation AnalysisXue Tan, Hao Luan, Mingyu Luo et al.
Retrieval-Augmented Generation (RAG) enriches the input to LLMs by retrieving information from the relevant knowledge database, enabling them to produce responses that are more accurate and contextually appropriate. It is worth noting that the knowledge database, being sourced from publicly available channels such as Wikipedia, inevitably introduces a new attack surface. RAG poisoning involves injecting malicious texts into the knowledge database, ultimately leading to the generation of the attacker's target response (also called poisoned response). However, there are currently limited methods available for detecting such poisoning attacks. We aim to bridge the gap in this work. Particularly, we introduce RevPRAG, a flexible and automated detection pipeline that leverages the activations of LLMs for poisoned response detection. Our investigation uncovers distinct patterns in LLMs' activations when generating correct responses versus poisoned responses. Our results on multiple benchmark datasets and RAG architectures show our approach could achieve 98% true positive rate, while maintaining false positive rates close to 1%.
CVFeb 7, 2025
Tolerance-Aware Deep OpticsJun Dai, Liqun Chen, Xinge Yang et al.
Deep optics has emerged as a promising approach by co-designing optical elements with deep learning algorithms. However, current research typically overlooks the analysis and optimization of manufacturing and assembly tolerances. This oversight creates a significant performance gap between designed and fabricated optical systems. To address this challenge, we present the first end-to-end tolerance-aware optimization framework that incorporates multiple tolerance types into the deep optics design pipeline. Our method combines physics-informed modelling with data-driven training to enhance optical design by accounting for and compensating for structural deviations in manufacturing and assembly. We validate our approach through computational imaging applications, demonstrating results in both simulations and real-world experiments. We further examine how our proposed solution improves the robustness of optical systems and vision algorithms against tolerances through qualitative and quantitative analyses. Code and additional visual results are available at openimaginglab.github.io/LensTolerance.
QUANT-PHOct 21, 2024
GFlowNets for Hamiltonian decomposition in groups of compatible operatorsIsaac L. Huidobro-Meezs, Jun Dai, Guillaume Rabusseau et al.
Quantum computing presents a promising alternative for the direct simulation of quantum systems with the potential to explore chemical problems beyond the capabilities of classical methods. However, current quantum algorithms are constrained by hardware limitations and the increased number of measurements required to achieve chemical accuracy. To address the measurement challenge, techniques for grouping commuting and anti-commuting terms, driven by heuristics, have been developed to reduce the number of measurements needed in quantum algorithms on near-term quantum devices. In this work, we propose a probabilistic framework using GFlowNets to group fully (FC) or qubit-wise commuting (QWC) terms within a given Hamiltonian. The significance of this approach is demonstrated by the reduced number of measurements for the found groupings; 51% and 67% reduction factors respectively for FC and QWC partitionings with respect to greedy coloring algorithms, highlighting the potential of GFlowNets for future applications in the measurement problem. Furthermore, the flexibility of our algorithm extends its applicability to other resource optimization problems in Hamiltonian simulation, such as circuit design.
CVFeb 17, 2025
A Physics-Informed Blur Learning Framework for Imaging SystemsLiqun Chen, Yuxuan Li, Jun Dai et al.
Accurate blur estimation is essential for high-performance imaging across various applications. Blur is typically represented by the point spread function (PSF). In this paper, we propose a physics-informed PSF learning framework for imaging systems, consisting of a simple calibration followed by a learning process. Our framework could achieve both high accuracy and universal applicability. Inspired by the Seidel PSF model for representing spatially varying PSF, we identify its limitations in optimization and introduce a novel wavefront-based PSF model accompanied by an optimization strategy, both reducing optimization complexity and improving estimation accuracy. Moreover, our wavefront-based PSF model is independent of lens parameters, eliminate the need for prior knowledge of the lens. To validate our approach, we compare it with recent PSF estimation methods (Degradation Transfer and Fast Two-step) through a deblurring task, where all the estimated PSFs are used to train state-of-the-art deblurring algorithms. Our approach demonstrates improvements in image quality in simulation and also showcases noticeable visual quality improvements on real captured images.