Daliang Xu

CR
h-index49
12papers
299citations
Novelty50%
AI Score55

12 Papers

NISep 8, 2023
LLMCad: Fast and Scalable On-device Large Language Model Inference

Daliang Xu, Wangsong Yin, Xin Jin et al.

Generative tasks, such as text generation and question answering, hold a crucial position in the realm of mobile applications. Due to their sensitivity to privacy concerns, there is a growing demand for their execution directly on mobile devices. Currently, the execution of these generative tasks heavily depends on Large Language Models (LLMs). Nevertheless, the limited memory capacity of these devices presents a formidable challenge to the scalability of such models. In our research, we introduce LLMCad, an innovative on-device inference engine specifically designed for efficient generative Natural Language Processing (NLP) tasks. The core idea behind LLMCad revolves around model collaboration: a compact LLM, residing in memory, takes charge of generating the most straightforward tokens, while a high-precision LLM steps in to validate these tokens and rectify any identified errors. LLMCad incorporates three novel techniques: (1) Instead of generating candidate tokens in a sequential manner, LLMCad employs the smaller LLM to construct a token tree, encompassing a wider range of plausible token pathways. Subsequently, the larger LLM can efficiently validate all of these pathways simultaneously. (2) It employs a self-adjusting fallback strategy, swiftly initiating the verification process whenever the smaller LLM generates an erroneous token. (3) To ensure a continuous flow of token generation, LLMCad speculatively generates tokens during the verification process by implementing a compute-IO pipeline. Through an extensive series of experiments, LLMCad showcases an impressive token generation speed, achieving rates up to 9.3x faster than existing inference engines.

AIJul 8, 2024
Fast On-device LLM Inference with NPUs

Daliang Xu, Hao Zhang, Liming Yang et al.

On-device inference for Large Language Models (LLMs), driven by increasing privacy concerns and advancements of mobile-sized models, has gained significant interest. However, even mobile-sized LLMs (e.g., Gemma-2B) encounter unacceptably high inference latency, often bottlenecked by the prefill stage in tasks like screen UI understanding. We present llm.npu, the first LLM inference system utilizing on-device Neural Processing Unit (NPU) offloading to reduce prefill latency. llm.npu enhances NPU offloading efficiency by re-constructing the prompt and model in three levels: (1) At prompt level, it divides variable-length prompts into multiple fixed-sized chunks while maintaining data dependencies; (2) At tensor level, it identifies and extracts significant outliers to run on the CPU/GPU in parallel with minimal overhead; (3) At block level, it schedules Transformer blocks in an out-of-order manner to the CPU/GPU and NPU based on their hardware affinity and sensitivity to accuracy. Compared to competitive baselines, llm.npu achieves 22.4x faster prefill speed and 30.7$\times$ energy savings on average, and up to 32.8x speedup in an end-to-end real-world application. For the first time, llm.npu achieves more than 1,000 tokens/sec prefilling for a billion-sized model.

DCSep 8, 2024
Elastic On-Device LLM Service

Wangsong Yin, Rongjie Yi, Daliang Xu et al.

On-device Large Language Models (LLMs) are transforming mobile AI, catalyzing applications like UI automation without privacy concerns. Nowadays the common practice is to deploy a single yet powerful LLM as a general task solver for multiple requests. We identify a key system challenge in this paradigm: current LLMs lack the elasticity to serve requests that have diversified Service-Level Objectives (SLOs) on inference latency. To tackle this, we present \sys, an on-device LLM service that elasticizes both the model and the prompt dimension of a full LLM. It incorporates (1) a one-shot neuron-reordering method, which leverages the intrinsic permutation consistency in transformer models to generate high-quality elasticized sub-models with minimal runtime switching overhead; (2) a dual-head tiny language model, which efficiently and effectively refines the prompt and orchestrates the elastification between model and prompt. We implement such an elastic on-device LLM service on multiple COTS smartphones, and evaluate \sys on both standalone NLP/mobile-agent datasets and end-to-end synthesized traces. On diverse SLOs, \sys outperforms 7 strong baselines in (absolute) accuracy by up to 14.83\% and 10.45\% on average, with <1\% TTFT switching overhead, on-par memory consumption and <100 offline GPU hours.

LGMay 19
Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization

Jinghe Zhang, Daliang Xu, Chenghua Wang et al.

Large language models (LLMs) are increasingly deployed on mobile devices, where Neural Processing Units (NPUs) necessitate fully static quantization for optimal inference efficiency. However, existing post-training quantization (PTQ) methods predominantly rely on dynamic activation quantization, rendering them incompatible with NPU hardware constraints. To bridge the gap between high-fidelity PTQ and NPU-constrained inference, we propose Quant.npu, a integer-only fully static quantization framework. It incorporates learnable quantization parameters and rotation matrices, enabling low-bit activation-weight quantization without runtime quantization parameters re-computation. Crucially, we identify that initialization and selective optimization of quantization parameters is pivotal for optimization stability, as improper initialization and naive joint optimization induce gradient instability that disrupts the optimization of rotation matrices. To address this, we propose a rotation-and-bit-width-aware initialization tailored to diverse activation profiles and a distribution-aware selective optimization (two-stage quantization pipeline) tailored to rotated and unrotated tensors. Furthermore, we introduce a sensitivity-guided adaptive mixed-precision scheme to balance accuracy with inference efficiency. Extensive experiments on real-world mobile NPUs demonstrate that Quant.npu achieves comparable accuracy to state-of-the-art methods, while reducing inference latency by up to 15.1%.

PFAug 22, 2025
Dynamic Sparse Attention on Mobile SoCs

Wangsong Yin, Daliang Xu, Mengwei Xu et al.

On-device running Large Language Models (LLMs) is nowadays a critical enabler towards preserving user privacy. We observe that the attention operator falls back from the special-purpose NPU to the general-purpose CPU/GPU because of quantization sensitivity in state-of-the-art frameworks. This fallback results in a degraded user experience and increased complexity in system scheduling. To this end, this paper presents shadowAttn, a system-algorithm codesigned sparse attention module with minimal reliance on CPU/GPU by only sparsely calculating the attention on a tiny portion of tokens. The key idea is to hide the overhead of estimating the important tokens with a NPU-based pilot compute. Further, shadowAttn proposes insightful techniques such as NPU compute graph bucketing, head-wise NPU-CPU/GPU pipeline and per-head fine-grained sparsity ratio to achieve high accuracy and efficiency. shadowAttn delivers the best performance with highly limited CPU/GPU resource; it requires much less CPU/GPU resource to deliver on-par performance of SoTA frameworks.

CLApr 8
MicroSpec: Accelerating Speculative Decoding with Lightweight In-Context Vocabularies

Zhiyang Chen, Daliang Xu, Yinyuan Zhang et al.

Large language models typically employ vocabularies of over 100k tokens, which creates a major computational bottleneck at the final linear projection layer when performing speculative decoding. Current methods for vocabulary pruning depend on either fixed or coarse-grained sub-vocabularies, requiring around 30k active tokens to preserve the quality of the draft model. We introduce MicroSpec, a training-free technique that overcomes this limitation by building a compact, context-sensitive active vocabulary on the fly for every decoding step. Exploiting the natural temporal locality found in language generation, MicroSpec attains high token coverage while reducing the average vocabulary size by more than 40x (down to under 3k tokens), all without any additional trained parameters. To translate this high sparsity into actual speedups on contemporary hardware, we present a co-designed system and algorithm that mitigates the overhead of sparse memory accesses via asynchronous gathering and GPU-resident state management. Acting as a plug-and-play enhancement, MicroSpec reduces draft inference latency by 51.6% on average, achieving an end-to-end speedup of 1.12-1.32x relative to the leading speculative decoding approach EAGLE-2 on various benchmarks, while also surpassing more sophisticated training-based pruning baselines.

LGJun 5, 2025
MobiEdit: Resource-efficient Knowledge Editing for Personalized On-device LLMs

Zhenyan Lu, Daliang Xu, Dongqi Cai et al. · cambridge

Large language models (LLMs) are deployed on mobile devices to power killer applications such as intelligent assistants. LLMs pre-trained on general corpora often hallucinate when handling personalized or unseen queries, leading to incorrect or outdated responses. Knowledge editing addresses this by identifying and adjusting a small crucial portion of model weights, without compromising the general knowledge. However, prior knowledge editing methods are impractical to run on local devices due to the resource-heavy backpropagation (BP) needed for updates. We present MobiEdit, the first mobile knowledge editing framework that enables efficient LLM personalization on commercial off-the-shelf (COTS) mobile devices. MobiEdit replaces full-precision BP with quantized forward-only gradient estimation, thus compatible with the energy-efficient mobile neural processing units (NPUs). MobiEdit replaces full-precision backpropagation with quantized forward-only gradient estimation, making it compatible with energy-efficient mobile NPUs. To further improve gradient estimation efficiency, we introduce two optimizations: an early stoping mechanism that adaptively terminates editing upon success and a prefix cache that reuses computation across steps. Our approach enables real-time editing of a 3B-parameter model (Qwen2.5-3B-Instruct) on COTS mobile devices with 7.6$\times$ less memory, 14.7 $\times$ less energy and 3.6$\times$ less latency compared to previous knowledge editing methods.

CLOct 17, 2025
Accelerating Mobile Language Model via Speculative Decoding and NPU-Coordinated Execution

Zhiyang Chen, Daliang Xu, Haiyang Shen et al.

Enhancing on-device large language models (LLMs) with contextual information from local data enables personalized and task-aware generation, powering use cases such as intelligent assistants and UI agents. While recent developments in neural processors have substantially improved the efficiency of prefill on mobile devices, the token-by-token generation process still suffers from high latency and limited hardware utilization due to its inherently memory-bound characteristics. This work presents sd.npu, a mobile inference framework that integrates speculative decoding with dynamic hardware scheduling to accelerate context-aware text generation on mobile devices. The framework introduces three synergistic components: (1) adaptive execution scheduling, which dynamically balances compute graphs between prefill and decoding phases; (2) context-aligned drafting, which improves speculative efficiency through lightweight online calibration to current tasks; and (3) hardware-efficient draft extension, which reuses and expands intermediate sequences to improve processing parallelism and reduce verification cost. Experiments on multiple smartphones and representative workloads show consistent improvements of up to 3.8x in generation speed and 4.7x in energy efficiency compared with existing mobile inference solutions. Component-level analysis further validates the contribution of each optimization.

LGJan 16, 2024
A Survey of Resource-efficient LLM and Multimodal Foundation Models

Mengwei Xu, Wangsong Yin, Dongqi Cai et al.

Large foundation models, including large language models (LLMs), vision transformers (ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine learning lifecycle, from training to deployment. However, the substantial advancements in versatility and performance these models offer come at a significant cost in terms of hardware resources. To support the growth of these large models in a scalable and environmentally sustainable way, there has been a considerable focus on developing resource-efficient strategies. This survey delves into the critical importance of such research, examining both algorithmic and systemic aspects. It offers a comprehensive analysis and valuable insights gleaned from existing literature, encompassing a broad array of topics from cutting-edge model architectures and training/serving algorithms to practical system designs and implementations. The goal of this survey is to provide an overarching understanding of how current approaches are tackling the resource challenges posed by large foundation models and to potentially inspire future breakthroughs in this field.

CRApr 20, 2020
S3Library: Automatically Eliminating C/C++ Buffer Overflow using Compatible Safer Libraries

Kang Sun, Daliang Xu, Dongwei Chen et al.

Annex K of C11, bounds-checking interfaces, recently introduced a set of alternative functions to mitigate buffer overflows, primarily those caused by string/memory functions. However, poor compatibility limits their adoption. Failure oblivious computing can eliminate the possibility that an attacker can exploit memory errors to corrupt the address space and significantly increase the availability of systems. In this paper, we present S3Library (Saturation-Memory-Access Safer String Library), which is compatible with the standard C library in terms of function signature. Our technique automatically replaces unsafe deprecated memory/string functions with safer versions that perform bounds checking and eliminate buffer overflows via boundless memory. S3Library employs MinFat, a very compact pointer representation following the Less is More principle, to encode metadata into unused upper bits within pointers. In addition, S3Library utilizes Saturation Memory Access to eliminate illegal memory accesses into boundless padding area. Even if an out-of-bounds access is made, the fault program will not be interrupted. We implement our scheme within the LLVM framework on X86-64 and evaluate our approach on correctness, security, runtime performance and availability.

CRFeb 29, 2020
DangKiller: Eliminating Dangling Pointers Efficiently via Implicit Identifier

Daliang Xu, Dongwei Chen, Chun Yang et al.

Use-After-Free vulnerabilities, allowing the attacker to access unintended memory via dangling pointers, are more threatening. However, most detection schemes can only detect dangling pointers and invalid them, but not provide a tolerance mechanism to repair the errors at runtime. Also, these techniques obtain and manage the metadata inefficiently with complex structures and too much scan (sweep). The goal of this paper is to use compiler instrumentation to eliminate dangling pointers automatically and efficiently. In this paper, we observe that most techniques lack accurate efficient pointer graph metadata maintaining methods, so they need to scan the log to reduce the redundancy and sweep the whole address space to find dangling pointers. Also, they lack a direct, efficiently obtaining metadata approach. The key insight of this paper is that a unique identifier can be used as a key to a hash or direct-map algorithm. Thus, this paper maintains the same implicit identifier with each memory object and its corresponding referent. Associating the unique ID with metadata for memory objects, obtaining and managing the pointer graph metadata can be efficiently. Therefore, with the delayed free technique adopted into C/C++, we present the DangKiller as a novel and lightweight dangling pointer elimination solution. We first demonstrate the MinFat Pointer, which can calculate unique implicit ID for each object and pointer quickly, and use hash algorithm to obtain metadata. Secondly, we propose the Log Cache and Log Compression mechanism based on the ID to decrease the redundancy of dangling pointer candidates. Coupled with the Address Tagging architecture on an ARM64 system, our experiments show that the DangKiller can eliminate use-after-free vulnerabilities at only 11% and 3% runtime overheads for the SPEC CPU2006 and 2017 benchmarks respectively, except for unique cases.

CRFeb 7, 2020
Saturation Memory Access: Mitigating Memory Spatial Errors without Terminating Programs

Dongwei Chen, Daliang Xu, Dong Tong et al.

Memory spatial errors, i.e., buffer overflow vulnerabilities, have been a well-known issue in computer security for a long time and remain one of the root causes of exploitable vulnerabilities. Most of the existing mitigation tools adopt a fail-stop strategy to protect programs from intrusions, which means the victim program will be terminated upon detecting a memory safety violation. Unfortunately, the fail-stop strategy harms the availability of software. In this paper, we propose Saturation Memory Access (SMA), a memory spatial error mitigation mechanism that prevents out-of-bounds access without terminating a program. SMA is based on a key observation that developers generally do not rely on out-of-bounds accesses to implement program logic. SMA modifies dynamic memory allocators and adds paddings to objects to form an enlarged object boundary. By dynamically correcting all the out-of-bounds accesses to operate on the enlarged protecting boundaries, SMA can tolerate out-of-bounds accesses. For the sake of compatibility, we chose tagged pointers to record the boundary metadata of a memory object in the pointer itself, and correct the address upon detecting out-of-bounds access. We have implemented the prototype of SMA on LLVM 10.0. Our results show that our compiler enables the programs to execute successfully through buffer overflow attacks. Experiments on MiBench show that our prototype incurs an overhead of 78\%. Further optimizations would require ISA supports.