CRAIDec 12, 2025

A Scalable Multi-GPU Framework for Encrypted Large-Model Inference

arXiv:2512.11269v17 citationsh-index: 4
Originality Highly original
AI Analysis

This work addresses the problem of making encrypted AI practical and accessible for privacy-sensitive applications by enabling efficient FHE inference on large models using widely available GPUs, representing a substantial advance rather than an incremental improvement.

The paper tackles the slow performance and scalability limitations of encrypted AI using fully homomorphic encryption (FHE) by introducing Cerium, a multi-GPU framework that achieves significant speedups, including 2.25x faster inference for small models compared to hand-optimized libraries and competitive performance with FHE ASICs, enabling encrypted inference for large models like BERT-Base in 8 seconds and Llama3-8B in 134 seconds.

Encrypted AI using fully homomorphic encryption (FHE) provides strong privacy guarantees; but its slow performance has limited practical deployment. Recent works proposed ASICs to accelerate FHE, but require expensive advanced manufacturing processes that constrain their accessibility. GPUs are a far more accessible platform, but achieving ASIC-level performance using GPUs has remained elusive. Furthermore, state-of-the-art approaches primarily focus on small models that fit comfortably within a single device. Supporting large models such as LLMs in FHE introduces a dramatic increase in computational complexity that requires optimized GPU kernels, along with managing terabyte-scale memory footprints that far exceed the capacity of a single GPU. This paper presents Cerium, a multi-GPU framework for FHE inference on large models. Cerium integrates a domain-specific language, an optimizing compiler, and a runtime system to automatically generate high-performance GPU kernels, manage terabyte-scale memory footprints, and parallelize computation across multiple GPUs. It introduces new IR constructs, compiler passes, sparse polynomial representations, memory-efficient data layouts, and communication-aware parallelization techniques that together enable encrypted inference for models ranging from small CNNs to Llama3-8B. We build Cerium on NVIDIA GPUs and demonstrate significant performance gains. For small models, Cerium outperforms expert-written hand-optimized GPU libraries by up to 2.25 times. Cerium achieves performance competitive with state-of-the-art FHE ASICs, outright matching prior FHE ASIC CraterLake. It is the first GPU system to execute bootstrapping in under 10 milliseconds, achieving 7.5 milliseconds, and is the first to demonstrate encrypted inference for BERT-Base and Llama3-8B in 8 seconds and 134 seconds, respectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes