CRAISep 11, 2025

Towards Confidential and Efficient LLM Inference with Dual Privacy Protection

arXiv:2509.09091v12 citationsh-index: 10DASFAA
Originality Incremental advance
AI Analysis

This addresses privacy concerns in LLM inference for users and organizations, but is incremental as it builds on existing TEE and DP approaches.

The paper tackles the problem of high inference latency and communication overhead in privacy-preserving LLM inference by proposing CMIF, which reduces additional inference overhead in TEEs while preserving user data privacy, as demonstrated on Llama-series models.

CPU-based trusted execution environments (TEEs) and differential privacy (DP) have gained wide applications for private inference. Due to high inference latency in TEEs, researchers use partition-based approaches that offload linear model components to GPUs. However, dense nonlinear layers of large language models (LLMs) result in significant communication overhead between TEEs and GPUs. DP-based approaches apply random noise to protect data privacy, but this compromises LLM performance and semantic understanding. To overcome the above drawbacks, this paper proposes CMIF, a Confidential and efficient Model Inference Framework. CMIF confidentially deploys the embedding layer in the client-side TEE and subsequent layers on GPU servers. Meanwhile, it optimizes the Report-Noisy-Max mechanism to protect sensitive inputs with a slight decrease in model performance. Extensive experiments on Llama-series models demonstrate that CMIF reduces additional inference overhead in TEEs while preserving user data privacy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes