Comet: A Communication-efficient and Performant Approximation for Private Transformer Inference
This work addresses the communication bottleneck in privacy-preserving cloud-based AI services, offering a practical improvement for deploying models like ChatGPT securely, though it is incremental in optimizing existing frameworks.
The paper tackles the high communication cost in private Transformer inference by introducing Comet, a plug-in method that reduces communication by up to 3.9x and achieves 3.5x speedups while maintaining competitive performance on BERT and RoBERTa models with GLUE benchmarks.
The prevalent use of Transformer-like models, exemplified by ChatGPT in modern language processing applications, underscores the critical need for enabling private inference essential for many cloud-based services reliant on such models. However, current privacy-preserving frameworks impose significant communication burden, especially for non-linear computation in Transformer model. In this paper, we introduce a novel plug-in method Comet to effectively reduce the communication cost without compromising the inference performance. We second introduce an efficient approximation method to eliminate the heavy communication in finding good initial approximation. We evaluate our Comet on Bert and RoBERTa models with GLUE benchmark datasets, showing up to 3.9$\times$ less communication and 3.5$\times$ speedups while keep competitive model performance compared to the prior art.