CRLGMay 21, 2025

An Efficient Private GPT Never Autoregressively Decodes

arXiv:2505.15252v11 citationsh-index: 29ICML
Originality Incremental advance
AI Analysis

This addresses privacy concerns in GPT deployment for clients and servers by improving efficiency in secure inference, though it is incremental as it builds on existing cryptographic methods.

The paper tackles the performance overhead of secure GPT inference by proposing a public decoding and secure verification approach, achieving a 2.1x to 6.0x speedup while maintaining privacy and generation quality.

The wide deployment of the generative pre-trained transformer (GPT) has raised privacy concerns for both clients and servers. While cryptographic primitives can be employed for secure GPT inference to protect the privacy of both parties, they introduce considerable performance overhead.To accelerate secure inference, this study proposes a public decoding and secure verification approach that utilizes public GPT models, motivated by the observation that securely decoding one and multiple tokens takes a similar latency. The client uses the public model to generate a set of tokens, which are then securely verified by the private model for acceptance. The efficiency of our approach depends on the acceptance ratio of tokens proposed by the public model, which we improve from two aspects: (1) a private sampling protocol optimized for cryptographic primitives and (2) model alignment using knowledge distillation. Our approach improves the efficiency of secure decoding while maintaining the same level of privacy and generation quality as standard secure decoding. Experiments demonstrate a $2.1\times \sim 6.0\times$ speedup compared to standard decoding across three pairs of public-private models and different network conditions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes