Lin Zhong

HC
h-index57
8papers
409citations
Novelty50%
AI Score34

8 Papers

22.9CLNov 7, 2023Code
Prompt Cache: Modular Attention Reuse for Low-Latency Inference

In Gim, Guojun Chen, Seung-seob Lee et al.

We present Prompt Cache, an approach for accelerating inference for large language models (LLM) by reusing attention states across different LLM prompts. Many input prompts have overlapping text segments, such as system messages, prompt templates, and documents provided for context. Our key insight is that by precomputing and storing the attention states of these frequently occurring text segments on the inference server, we can efficiently reuse them when these segments appear in user prompts. Prompt Cache employs a schema to explicitly define such reusable text segments, called prompt modules. The schema ensures positional accuracy during attention state reuse and provides users with an interface to access cached states in their prompt. Using a prototype implementation, we evaluate Prompt Cache across several LLMs. We show that Prompt Cache significantly reduce latency in time-to-first-token, especially for longer prompts such as document-based question answering and recommendations. The improvements range from 8x for GPU-based inference to 60x for CPU-based inference, all while maintaining output accuracy and without the need for model parameter modifications.

7.3CRSep 27, 2024Code
Confidential Prompting: Privacy-preserving LLM Inference on Cloud

Caihua Li, In Gim, Lin Zhong

This paper introduces a vision of confidential prompting: securing user prompts from an untrusted, cloud-hosted large language model (LLM) while preserving model confidentiality, output invariance, and compute efficiency. As a first step toward this vision, we present Petridish, a system built on top of confidential computing and its core contribution, a novel technology called Secure Partitioned Decoding (SPD). Petridish runs the LLM service inside a confidential virtual machine (CVM), which protects the secrets, i.e., the LLM parameters and user prompts, from adversaries outside the CVM. Importantly, it splits the LLM service for a user into two processes, using SPD: a per-user process performs prefill with the user prompts and computes attention scores during decoding; a service process, shared by all users, batches the attention scores from per-user processes and generates output tokens for all users. Both the LLM provider and the users trust Petridish's CVM and its operating system, which guarantees isolation between processes and limits their outbound network capabilities to control information flow. The CVM's attestation capability and its open-source software stack enable Petridish to provide auditable protection of both user prompt and LLM confidentiality. Together, Petridish maintains full utility of LLM service and enables practical, privacy-preserving cloud-hosted LLM inference for sensitive applications, such as processing personal data, clinical records, and financial documents.

1.2PLJan 18, 2025
MappedTrace: Tracing Pointer Remotely with Compiler-generated Maps

Zhiyao Ma, Caihua Li, Lin Zhong

Existing precise pointer tracing methods introduce substantial runtime overhead to the program being traced and are applicable only at specific program execution points. We propose MappedTrace that leverages compiler-generated read-only maps to accurately identify all pointers in any given snapshot of a program's execution state. The maps record the locations and types of pointers, allowing the tracer to precisely identify pointers without requiring the traced program to maintain bookkeeping data structures or poll at safe points, thereby reducing runtime overhead. By running the tracer from a different address space or machine, MappedTrace presents new opportunities to improve memory management techniques like memory leak detection and enables novel use cases such as infinite memory abstraction for resource-constrained environments.

3.7HCMay 26, 2021
POD: A Smartphone That Flies

Guojun Chen, Noah Weiner, Lin Zhong

We present POD, a smartphone that flies, as a new way to achieve hands-free, eyes-up mobile computing. Unlike existing drone-carried user interfaces, POD features a smartphone-sized display and the computing and sensing power of a modern smartphone. We share our experience in building a prototype of POD, discuss the technical challenges facing it, and describe early results toward addressing them.

14.0LGJun 8, 2020
Privacy Adversarial Network: Representation Learning for Mobile Data Privacy

Sicong Liu, Junzhao Du, Anshumali Shrivastava et al.

The remarkable success of machine learning has fostered a growing number of cloud-based intelligent services for mobile users. Such a service requires a user to send data, e.g. image, voice and video, to the provider, which presents a serious challenge to user privacy. To address this, prior works either obfuscate the data, e.g. add noise and remove identity information, or send representations extracted from the data, e.g. anonymized features. They struggle to balance between the service utility and data privacy because obfuscated data reduces utility and extracted representation may still reveal sensitive information. This work departs from prior works in methodology: we leverage adversarial learning to a better balance between privacy and utility. We design a \textit{representation encoder} that generates the feature representations to optimize against the privacy disclosure risk of sensitive information (a measure of privacy) by the \textit{privacy adversaries}, and concurrently optimize with the task inference accuracy (a measure of utility) by the \textit{utility discriminator}. The result is the privacy adversarial network (\systemname), a novel deep model with the new training algorithm, that can automatically learn representations from the raw data. Intuitively, PAN adversarially forces the extracted representations to only convey the information required by the target task. Surprisingly, this constitutes an implicit regularization that actually improves task accuracy. As a result, PAN achieves better utility and better privacy at the same time! We report extensive experiments on six popular datasets and demonstrate the superiority of \systemname compared with alternative methods reported in prior work.

6.6LGJan 25, 2019
Better accuracy with quantified privacy: representations learned via reconstructive adversarial network

Sicong Liu, Anshumali Shrivastava, Junzhao Du et al.

The remarkable success of machine learning, especially deep learning, has produced a variety of cloud-based services for mobile users. Such services require an end user to send data to the service provider, which presents a serious challenge to end-user privacy. To address this concern, prior works either add noise to the data or send features extracted from the raw data. They struggle to balance between the utility and privacy because added noise reduces utility and raw data can be reconstructed from extracted features. This work represents a methodical departure from prior works: we balance between a measure of privacy and another of utility by leveraging adversarial learning to find a sweeter tradeoff. We design an encoder that optimizes against the reconstruction error (a measure of privacy), adversarially by a Decoder, and the inference accuracy (a measure of utility) by a Classifier. The result is RAN, a novel deep model with a new training algorithm that automatically extracts features for classification that are both private and useful. It turns out that adversarially forcing the extracted features to only conveys the intended information required by classification leads to an implicit regularization leading to better classification accuracy than the original model which completely ignores privacy. Thus, we achieve better privacy with better utility, a surprising possibility in machine learning! We conducted extensive experiments on five popular datasets over four training schemes, and demonstrate the superiority of RAN compared with existing alternatives.

16.4HCMar 26, 2014
Draining our Glass: An Energy and Heat Characterization of Google Glass

Robert LiKamWa, Zhen Wang, Aaron Carroll et al.

The Google Glass is a mobile device designed to be worn as eyeglasses. This form factor enables new usage possibilities, such as hands-free video chats and instant web search. However, its shape also hampers its potential: (1) battery size, and therefore lifetime, is limited by a need for the device to be lightweight, and (2) high-power processing leads to significant heat, which should be limited, due to the Glass' compact form factor and close proximity to the user's skin. We use the Glass in a case study of the power and thermal characteristics of optical head-mounted display devices. We share insights and implications to limit power consumption to increase the safety and utility of head-mounted devices.

7.6HCSep 3, 2012
Practical Context Awareness: Measuring and Utilizing the Context Dependency of Mobile Usage

Ahmad Rahmati, Clayton Shepard, Chad Tossell et al.

Context information brings new opportunities for efficient and effective applications and services on mobile devices. A wide range of research has exploited context dependency, i.e., the relations between context(s) and the outcome, to achieve significant, quantified, performance gains for a variety of applications. These works often have to deal with the challenges of multiple sources of context that can lead to a sparse training data set, and the challenge of energy hungry context sensors. Often, they address these challenges in an application specific and ad-hoc manner. We liberate mobile application designers and researchers from these burdens by providing a methodical approach to these challenges. In particular, we 1) define and measure the context-dependency of three fundamental types of mobile usage in an application agnostic yet practical manner, which can provide clear insight into the performance of potential ap-plication. 2) Address the challenge of data sparseness when dealing with multiple and different sources of context in a systematic manner. 3) Present SmartContext to address the energy challenge by automatically selecting among context sources while ensuring the minimum accuracy for each estimation event is met. Our analysis and findings are based on usage and context traces collected in real-life settings from 24 iPhone users over a period of one year. We present findings regarding the context dependency of the three principal types of mobile usage; visited websites, phone calls, and app usage. Yet, our methodology and the lessons we learn can be readily extended to other context-dependent mobile usage and system resources as well. Our findings guide the development of context aware systems, and highlight the challenges and expectations regarding the context dependency of mobile usage.