AIJul 8, 2024Code
On Speeding Up Language Model EvaluationJin Peng Zhou, Christian K. Belardi, Ruihan Wu et al.
Developing prompt-based methods with Large Language Models (LLMs) requires making numerous decisions, which give rise to a combinatorial search problem over hyper-parameters. This exhaustive evaluation can be time-consuming and costly. In this paper, we propose an $\textit{adaptive}$ approach to explore this space. We are exploiting the fact that often only few samples are needed to identify clearly superior or inferior settings, and that many evaluation tests are highly correlated. We lean on multi-armed bandits to sequentially identify the next (method, validation sample)-pair to evaluate and utilize low-rank matrix factorization to fill in missing evaluations. We carefully assess the efficacy of our approach on several competitive benchmark problems and show that it can identify the top-performing method using only 5-15% of the typical resources -- resulting in 85-95% LLM cost savings. Our code is available at https://github.com/kilian-group/banditeval.
LGOct 19, 2022
Learning to Invert: Simple Adaptive Attacks for Gradient Inversion in Federated LearningRuihan Wu, Xiangyu Chen, Chuan Guo et al.
Gradient inversion attack enables recovery of training samples from model gradients in federated learning (FL), and constitutes a serious threat to data privacy. To mitigate this vulnerability, prior work proposed both principled defenses based on differential privacy, as well as heuristic defenses based on gradient compression as countermeasures. These defenses have so far been very effective, in particular those based on gradient compression that allow the model to maintain high accuracy while greatly reducing the effectiveness of attacks. In this work, we argue that such findings underestimate the privacy risk in FL. As a counterexample, we show that existing defenses can be broken by a simple adaptive attack, where a model trained on auxiliary data is able to invert gradients on both vision and language tasks.
CRJun 16, 2022
Differentially Private Multi-Party Data Release for Linear RegressionRuihan Wu, Xin Yang, Yuanshun Yao et al.
Differentially Private (DP) data release is a promising technique to disseminate data without compromising the privacy of data subjects. However the majority of prior work has focused on scenarios where a single party owns all the data. In this paper we focus on the multi-party setting, where different stakeholders own disjoint sets of attributes belonging to the same group of data subjects. Within the context of linear regression that allow all parties to train models on the complete data without the ability to infer private attributes or identities of individuals, we start with directly applying Gaussian mechanism and show it has the small eigenvalue problem. We further propose our novel method and prove it asymptotically converges to the optimal (non-private) solutions with increasing dataset size. We substantiate the theoretical results through experiments on both artificial and real-world datasets.
LGSep 8, 2024Code
Influence-based Attributions can be ManipulatedChhavi Yadav, Ruihan Wu, Kamalika Chaudhuri
Influence Functions are a standard tool for attributing predictions to training data in a principled manner and are widely used in applications such as data valuation and fairness. In this work, we present realistic incentives to manipulate influence-based attributions and investigate whether these attributions can be \textit{systematically} tampered by an adversary. We show that this is indeed possible for logistic regression models trained on ResNet feature embeddings and standard tabular fairness datasets and provide efficient attacks with backward-friendly implementations. Our work raises questions on the reliability of influence-based attributions in adversarial circumstances. Code is available at : \url{https://github.com/infinite-pursuits/influence-based-attributions-can-be-manipulated}
CVAug 4, 2023
Large-Scale Public Data Improves Differentially Private Image Generation QualityRuihan Wu, Chuan Guo, Kamalika Chaudhuri
Public data has been frequently used to improve the privacy-accuracy trade-off of differentially private machine learning, but prior work largely assumes that this data come from the same distribution as the private. In this work, we look at how to use generic large-scale public data to improve the quality of differentially private image generation in Generative Adversarial Networks (GANs), and provide an improved method that uses public data effectively. Our method works under the assumption that the support of the public data distribution contains the support of the private; an example of this is when the public data come from a general-purpose internet-scale image source, while the private data consist of images of a specific type. Detailed evaluations show that our method achieves SOTA in terms of FID score and other metrics compared with existing methods that use public data, and can generate high-quality, photo-realistic images in a differentially private manner.
52.3CLMay 7Code
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn DialogueXinjie Shen, Rongzhe Wei, Peizhi Niu et al.
Hidden malicious intent in multi-turn dialogue poses a growing threat to deployed large language models (LLMs). Rather than exposing a harmful objective in a single prompt, increasingly capable attackers can distribute their intent across multiple benign-looking turns. Recent studies show that even modern commercial models with advanced guardrails remain vulnerable to such attacks despite advances in safety alignment and external guardrails. In this work, we address this challenge by detecting the earliest turn at which delivering the candidate response would make the accumulated interaction sufficient to enable harmful action. This objective requires precise turn-level intervention that identifies the harm-enabling closure point while avoiding premature refusal of benign exploratory conversations. To further support training and evaluation, we construct the Multi-Turn Intent Dataset (MTID), which contains branching attack rollouts, matched benign hard negatives, and annotations of the earliest harm-enabling turns. We show that MTID helps enable a turn-level monitor TurnGate, which substantially outperforms existing baselines in harmful-intent detection while maintaining low over-refusal rates. TurnGate further generalizes across domains, attacker pipelines, and target models. Our code is available at https://github.com/Graph-COM/TurnGate.
67.8LGApr 17
DPrivBench: Benchmarking LLMs' Reasoning for Differential PrivacyErchi Wang, Pengrun Huang, Eli Chien et al.
Differential privacy (DP) has a wide range of applications for protecting data privacy, but designing and verifying DP algorithms requires expert-level reasoning, creating a high barrier for non-expert practitioners. Prior works either rely on specialized verification languages that demand substantial domain expertise or remain semi-automated and require human-in-the-loop guidance. In this work, we investigate whether large language models (LLMs) can automate DP reasoning. We introduce DPrivBench, a benchmark in which each instance asks whether a function or algorithm satisfies a stated DP guarantee under specified assumptions. The benchmark is carefully designed to cover a broad range of DP topics, span diverse difficulty levels, and resist shortcut reasoning through trivial pattern matching. Experiments show that while the strongest models handle textbook mechanisms well, all models struggle with advanced algorithms, revealing substantial gaps in current DP reasoning capabilities. Through further analytic study and failure-mode analysis, we identify several promising directions for improving automated DP reasoning. Our benchmark provides a solid foundation for developing and evaluating such methods, and complements existing benchmarks for mathematical reasoning.
LGNov 10, 2025
Private-RAG: Answering Multiple Queries with LLMs while Keeping Your Data PrivateRuihan Wu, Erchi Wang, Zhiyuan Zhang et al.
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by retrieving documents from an external corpus at inference time. When this corpus contains sensitive information, however, unprotected RAG systems are at risk of leaking private information. Prior work has introduced differential privacy (DP) guarantees for RAG, but only in single-query settings, which fall short of realistic usage. In this paper, we study the more practical multi-query setting and propose two DP-RAG algorithms. The first, MURAG, leverages an individual privacy filter so that the accumulated privacy loss only depends on how frequently each document is retrieved rather than the total number of queries. The second, MURAG-ADA, further improves utility by privately releasing query-specific thresholds, enabling more precise selection of relevant documents. Our experiments across multiple LLMs and datasets demonstrate that the proposed methods scale to hundreds of queries within a practical DP budget ($\varepsilon\approx10$), while preserving meaningful utility.
CRDec 6, 2024
Privacy-Preserving Retrieval-Augmented Generation with Differential PrivacyTatsuki Koga, Ruihan Wu, Zhiyuan Zhang et al.
With the recent remarkable advancement of large language models (LLMs), there has been a growing interest in utilizing them in the domains with highly sensitive data that lies outside their training data. For this purpose, retrieval-augmented generation (RAG) is particularly effective -- it assists LLMs by directly providing relevant information from the external knowledge sources. However, without extra privacy safeguards, RAG outputs risk leaking sensitive information from the external data source. In this work, we explore RAG under differential privacy (DP), a formal guarantee of data privacy. The main challenge with differentially private RAG is how to generate long accurate answers within a moderate privacy budget. We address this by proposing an algorithm that smartly spends privacy budget only for the tokens that require the sensitive information and uses the non-private LLM for other tokens. Our extensive empirical evaluations reveal that our algorithm outperforms the non-RAG baseline under a reasonable privacy budget of $ε\approx 10$ across different models and datasets.
CLOct 19, 2024
Evaluating Deep Unlearning in Large Language ModelsRuihan Wu, Chhavi Yadav, Russ Salakhutdinov et al.
Machine unlearning has emerged as an important component in developing safe and trustworthy models. Prior work on fact unlearning in LLMs has mostly focused on removing a specified target fact robustly, but often overlooks its deductive connections to other knowledge. We propose a new setting for fact unlearning, deep unlearning, where the goal is not only to remove a target fact but also to prevent it from being deduced via retained knowledge in the LLM and logical reasoning. We propose three novel metrics: Success-DU and Recall to measure unlearning efficacy, and Accuracy to measure the remainder model utility. To benchmark this setting, we leverage both (1) an existing real-world knowledge dataset, MQuAKE, that provides one-step deduction instances, and (2) newly construct a novel semi-synthetic dataset, Eval-DU, that allows multiple steps of realistic deductions among synthetic facts. Experiments reveal that current methods struggle with deep unlearning: they either fail to deeply unlearn, or excessively remove unrelated facts. Our results suggest that targeted algorithms may have to be developed for robust/deep fact unlearning in LLMs.
LGFeb 5, 2024
Online Feature Updates Improve Online (Generalized) Label Shift AdaptationRuihan Wu, Siddhartha Datta, Yi Su et al.
This paper addresses the prevalent issue of label shift in an online setting with missing labels, where data distributions change over time and obtaining timely labels is challenging. While existing methods primarily focus on adjusting or updating the final layer of a pre-trained classifier, we explore the untapped potential of enhancing feature representations using unlabeled data at test-time. Our novel method, Online Label Shift adaptation with Online Feature Updates (OLS-OFU), leverages self-supervised learning to refine the feature extraction process, thereby improving the prediction model. By carefully designing the algorithm, theoretically OLS-OFU maintains the similar online regret convergence to the results in the literature while taking the improved features into account. Empirically, it achieves substantial improvements over existing methods, which is as significant as the gains existing methods have over the baseline (i.e., without distribution shift adaptations).
LGJun 12, 2025
Can We Infer Confidential Properties of Training Data from LLMs?Pengrun Huang, Chhavi Yadav, Kamalika Chaudhuri et al.
Large language models (LLMs) are increasingly fine-tuned on domain-specific datasets to support applications in fields such as healthcare, finance, and law. These fine-tuning datasets often have sensitive and confidential dataset-level properties -- such as patient demographics or disease prevalence -- that are not intended to be revealed. While prior work has studied property inference attacks on discriminative models (e.g., image classification models) and generative models (e.g., GANs for image data), it remains unclear if such attacks transfer to LLMs. In this work, we introduce PropInfer, a benchmark task for evaluating property inference in LLMs under two fine-tuning paradigms: question-answering and chat-completion. Built on the ChatDoctor dataset, our benchmark includes a range of property types and task configurations. We further propose two tailored attacks: a prompt-based generation attack and a shadow-model attack leveraging word frequency signals. Empirical evaluations across multiple pretrained LLMs show the success of our attacks, revealing a previously unrecognized vulnerability in LLMs.
CLJun 18, 2025
Learning-Time Encoding Shapes Unlearning in LLMsRuihan Wu, Konstantin Garov, Kamalika Chaudhuri
As large language models (LLMs) are increasingly deployed in the real world, the ability to ``unlearn'', or remove specific pieces of knowledge post hoc, has become essential for a variety of reasons ranging from privacy regulations to correcting outdated or harmful content. Prior work has proposed unlearning benchmarks and algorithms, and has typically assumed that the training process and the target model are fixed. In this work, we empirically investigate how learning-time choices in knowledge encoding impact the effectiveness of unlearning factual knowledge. Our experiments reveal two key findings: (1) learning with paraphrased descriptions improves unlearning performance and (2) unlearning individual piece of knowledge from a chunk of text is challenging. Our results suggest that learning-time knowledge encoding may play a central role in enabling reliable post-hoc unlearning.
LGFeb 16, 2025
Improved Regret in Stochastic Decision-Theoretic Online Learning under Differential PrivacyRuihan Wu, Yu-Xiang Wang
Hu and Mehta (2024) posed an open problem: what is the optimal instance-dependent rate for the stochastic decision-theoretic online learning (with $K$ actions and $T$ rounds) under $\varepsilon$-differential privacy? Before, the best known upper bound and lower bound are $O\left(\frac{\log K}{Δ_{\min}} + \frac{\log K\log T}{\varepsilon}\right)$ and $Ω\left(\frac{\log K}{Δ_{\min}} + \frac{\log K}{\varepsilon}\right)$ (where $Δ_{\min}$ is the gap between the optimal and the second actions). In this paper, we partially address this open problem by having two new results. First, we provide an improved upper bound for this problem $O\left(\frac{\log K}{Δ_{\min}} + \frac{\log^2K}{\varepsilon}\right)$, which is $T$-independent and only has a log dependency in $K$. Second, to further understand the gap, we introduce the \textit{deterministic setting}, a weaker setting of this open problem, where the received loss vector is deterministic. At this weaker setting, a direct application of the analysis and algorithms from the original setting still leads to an extra log factor. We conduct a novel analysis which proves upper and lower bounds that match at $Θ(\frac{\log K}{\varepsilon})$.
LGFeb 25, 2022
Does Label Differential Privacy Prevent Label Inference Attacks?Ruihan Wu, Jin Peng Zhou, Kilian Q. Weinberger et al.
Label differential privacy (label-DP) is a popular framework for training private ML models on datasets with public features and sensitive private labels. Despite its rigorous privacy guarantee, it has been observed that in practice label-DP does not preclude label inference attacks (LIAs): Models trained with label-DP can be evaluated on the public training features to recover, with high accuracy, the very private labels that it was designed to protect. In this work, we argue that this phenomenon is not paradoxical and that label-DP is designed to limit the advantage of an LIA adversary compared to predicting training labels using the Bayes classifier. At label-DP $ε=0$ this advantage is zero, hence the optimal attack is to predict according to the Bayes classifier and is independent of the training labels. Our bound shows the semantic protection conferred by label-DP and gives guidelines on how to choose $\varepsilon$ to limit the threat of LIAs below a certain level. Finally, we empirically demonstrate that our result closely captures the behavior of simulated attacks on both synthetic and real world datasets.
LGJul 9, 2021
Online Adaptation to Label Distribution ShiftRuihan Wu, Chuan Guo, Yi Su et al.
Machine learning models often encounter distribution shifts when deployed in the real world. In this paper, we focus on adaptation to label distribution shift in the online setting, where the test-time label distribution is continually changing and the model must dynamically adapt to it without observing the true label. Leveraging a novel analysis, we show that the lack of true label does not hinder estimation of the expected test loss, which enables the reduction of online label shift adaptation to conventional online learning. Informed by this observation, we propose adaptation algorithms inspired by classical online learning techniques such as Follow The Leader (FTL) and Online Gradient Descent (OGD) and derive their regret bounds. We empirically verify our findings under both simulated and real world label distribution shifts and show that OGD is particularly effective and robust to a variety of challenging label shift scenarios.
LGMar 22, 2021
Fixes That Fail: Self-Defeating Improvements in Machine-Learning SystemsRuihan Wu, Chuan Guo, Awni Hannun et al.
Machine-learning systems such as self-driving cars or virtual assistants are composed of a large number of machine-learning models that recognize image content, transcribe speech, analyze natural language, infer preferences, rank options, etc. Models in these systems are often developed and trained independently, which raises an obvious concern: Can improving a machine-learning model make the overall system worse? We answer this question affirmatively by showing that improving a model can deteriorate the performance of downstream models, even after those downstream models are retrained. Such self-defeating improvements are the result of entanglement between the models in the system. We perform an error decomposition of systems with multiple machine-learning models, which sheds light on the types of errors that can lead to self-defeating improvements. We also present the results of experiments which show that self-defeating improvements emerge in a realistic stereo-based detection system for cars and pedestrians.
CRFeb 9, 2021
Making Paper Reviewing Robust to Bid Manipulation AttacksRuihan Wu, Chuan Guo, Felix Wu et al.
Most computer science conferences rely on paper bidding to assign reviewers to papers. Although paper bidding enables high-quality assignments in days of unprecedented submission numbers, it also opens the door for dishonest reviewers to adversarially influence paper reviewing assignments. Anecdotal evidence suggests that some reviewers bid on papers by "friends" or colluding authors, even though these papers are outside their area of expertise, and recommend them for acceptance without considering the merit of the work. In this paper, we study the efficacy of such bid manipulation attacks and find that, indeed, they can jeopardize the integrity of the review process. We develop a novel approach for paper bidding and assignment that is much more robust against such attacks. We show empirically that our approach provides robustness even when dishonest reviewers collude, have full knowledge of the assignment system's internal workings, and have access to the system's inputs. In addition to being more robust, the quality of our paper review assignments is comparable to that of current, non-robust assignment approaches.
STR-ELNov 6, 2020
Correlator Convolutional Neural Networks: An Interpretable Architecture for Image-like Quantum Matter DataCole Miles, Annabelle Bohrdt, Ruihan Wu et al.
Machine learning models are a powerful theoretical tool for analyzing data from quantum simulators, in which results of experiments are sets of snapshots of many-body states. Recently, they have been successfully applied to distinguish between snapshots that can not be identified using traditional one and two point correlation functions. Thus far, the complexity of these models has inhibited new physical insights from this approach. Here, using a novel set of nonlinearities we develop a network architecture that discovers features in the data which are directly interpretable in terms of physical observables. In particular, our network can be understood as uncovering high-order correlators which significantly differ between the data studied. We demonstrate this new architecture on sets of simulated snapshots produced by two candidate theories approximating the doped Fermi-Hubbard model, which is realized in state-of-the art quantum gas microscopy experiments. From the trained networks, we uncover that the key distinguishing features are fourth-order spin-charge correlators, providing a means to compare experimental data to theoretical predictions. Our approach lends itself well to the construction of simple, end-to-end interpretable architectures and is applicable to arbitrary lattice data, thus paving the way for new physical insights from machine learning studies of experimental as well as numerical data.
LGFeb 24, 2020
On Hiding Neural Networks Inside Neural NetworksChuan Guo, Ruihan Wu, Kilian Q. Weinberger
Modern neural networks often contain significantly more parameters than the size of their training data. We show that this excess capacity provides an opportunity for embedding secret machine learning models within a trained neural network. Our novel framework hides the existence of a secret neural network with arbitrary desired functionality within a carrier network. We prove theoretically that the secret network's detection is computationally infeasible and demonstrate empirically that the carrier network does not compromise the secret network's disguise. Our paper introduces a previously unknown steganographic technique that can be exploited by adversaries if left unchecked.
LGFeb 24, 2018
Product Kernel Interpolation for Scalable Gaussian ProcessesJacob R. Gardner, Geoff Pleiss, Ruihan Wu et al.
Recent work shows that inference for Gaussian processes can be performed efficiently using iterative methods that rely only on matrix-vector multiplications (MVMs). Structured Kernel Interpolation (SKI) exploits these techniques by deriving approximate kernels with very fast MVMs. Unfortunately, such strategies suffer badly from the curse of dimensionality. We develop a new technique for MVM based learning that exploits product kernel structure. We demonstrate that this technique is broadly applicable, resulting in linear rather than exponential runtime with dimension for SKI, as well as state-of-the-art asymptotic complexity for multi-task GPs.
LGFeb 18, 2017
Quadratic Upper Bound for Recursive Teaching Dimension of Finite VC ClassesLunjia Hu, Ruihan Wu, Tianhong Li et al.
In this work we study the quantitative relation between the recursive teaching dimension (RTD) and the VC dimension (VCD) of concept classes of finite sizes. The RTD of a concept class $\mathcal C \subseteq \{0, 1\}^n$, introduced by Zilles et al. (2011), is a combinatorial complexity measure characterized by the worst-case number of examples necessary to identify a concept in $\mathcal C$ according to the recursive teaching model. For any finite concept class $\mathcal C \subseteq \{0,1\}^n$ with $\mathrm{VCD}(\mathcal C)=d$, Simon & Zilles (2015) posed an open problem $\mathrm{RTD}(\mathcal C) = O(d)$, i.e., is RTD linearly upper bounded by VCD? Previously, the best known result is an exponential upper bound $\mathrm{RTD}(\mathcal C) = O(d \cdot 2^d)$, due to Chen et al. (2016). In this paper, we show a quadratic upper bound: $\mathrm{RTD}(\mathcal C) = O(d^2)$, much closer to an answer to the open problem. We also discuss the challenges in fully solving the problem.