Ninghui Li

h-index14

15papers

586citations

Novelty57%

AI Score51

Ranked #17,487 of 194,257 authors (top 9%)#303 in CR (top 4%)

15 Papers

15.3CRAug 2, 2022

Differentially Private Vertical Federated Clustering

Zitao Li, Tianhao Wang, Ninghui Li

In many applications, multiple parties have private data regarding the same set of users but on disjoint sets of attributes, and a server wants to leverage the data to train a model. To enable model learning while protecting the privacy of the data subjects, we need vertical federated learning (VFL) techniques, where the data parties share only information for training the model, instead of the private data. However, it is challenging to ensure that the shared information maintains privacy while learning accurate models. To the best of our knowledge, the algorithm proposed in this paper is the first practical solution for differentially private vertical federated k-means clustering, where the server can obtain a set of global centers with a provable differential privacy guarantee. Our algorithm assumes an untrusted central server that aggregates differentially private local centers and membership encodings from local data parties. It builds a weighted grid as the synopsis of the global dataset based on the received information. Final centers are generated by running any k-means algorithm on the weighted grid. Our approach for grid weight estimation uses a novel, light-weight, and differentially private set intersection cardinality estimation algorithm based on the Flajolet-Martin sketch. To improve the estimation accuracy in the setting with more than two data parties, we further propose a refined version of the weights estimation algorithm and a parameter tuning strategy to reduce the final k-means utility to be close to that in the central private setting. We provide theoretical utility analysis and experimental evaluation results for the cluster centers computed by our algorithm and show that our approach performs better both theoretically and empirically than the two baselines based on existing techniques.

13.5CRApr 14Code

Automated Profile Inference with Language Model Agents

Yuntao Du, Zitao Li, Bolin Ding et al.

Impressive progress has been made in automated problem-solving by the collaboration of large language model (LLM) based agents. However, these automated capabilities also open avenues for malicious applications. In this paper, we study a new threat that LLMs pose to online pseudonymity, called automated profile inference, where an adversary can instruct LLMs to automatically collect and extract sensitive personal attributes from publicly available user activities on pseudonymous platforms. We also introduce an automated profiling framework called AutoProfiler to demonstrate and assess the feasibility of such attacks in real-world scenarios. AutoProfiler consists of four specialized LLM agents that work collaboratively to retrieve and process user online activities and generate a profile with extracted personal information. Experimental results on two real-world datasets and one synthetic dataset show that AutoProfiler is highly effective and efficient, and the inferred attributes are both identifiable and sensitive, posing significant privacy risks. We explore mitigation strategies from different perspectives and advocate for increased public awareness of this emerging privacy threat.

13.1CRNov 2, 2023

MIST: Defending Against Membership Inference Attacks Through Membership-Invariant Subspace Training

Jiacheng Li, Ninghui Li, Bruno Ribeiro

In Member Inference (MI) attacks, the adversary try to determine whether an instance is used to train a machine learning (ML) model. MI attacks are a major privacy concern when using private data to train ML models. Most MI attacks in the literature take advantage of the fact that ML models are trained to fit the training data well, and thus have very low loss on training instances. Most defenses against MI attacks therefore try to make the model fit the training data less well. Doing so, however, generally results in lower accuracy. We observe that training instances have different degrees of vulnerability to MI attacks. Most instances will have low loss even when not included in training. For these instances, the model can fit them well without concerns of MI attacks. An effective defense only needs to (possibly implicitly) identify instances that are vulnerable to MI attacks and avoids overfitting them. A major challenge is how to achieve such an effect in an efficient training process. Leveraging two distinct recent advancements in representation learning: counterfactually-invariant representations and subspace learning methods, we introduce a novel Membership-Invariant Subspace Training (MIST) method to defend against MI attacks. MIST avoids overfitting the vulnerable instances without significant impact on other instances. We have conducted extensive experimental studies, comparing MIST with various other state-of-the-art (SOTA) MI defenses against several SOTA MI attacks. We find that MIST outperforms other defenses while resulting in minimal reduction in testing accuracy.

13.0LGJan 27, 2025Code

CENSOR: Defense Against Gradient Inversion via Orthogonal Subspace Bayesian Sampling

Kaiyuan Zhang, Siyuan Cheng, Guangyu Shen et al.

Federated learning collaboratively trains a neural network on a global server, where each local client receives the current global model weights and sends back parameter updates (gradients) based on its local private data. The process of sending these model updates may leak client's private data information. Existing gradient inversion attacks can exploit this vulnerability to recover private training instances from a client's gradient vectors. Recently, researchers have proposed advanced gradient inversion techniques that existing defenses struggle to handle effectively. In this work, we present a novel defense tailored for large neural network models. Our defense capitalizes on the high dimensionality of the model parameters to perturb gradients within a subspace orthogonal to the original gradient. By leveraging cold posteriors over orthogonal subspaces, our defense implements a refined gradient update mechanism. This enables the selection of an optimal gradient that not only safeguards against gradient inversion attacks but also maintains model utility. We conduct comprehensive experiments across three different datasets and evaluate our defense against various state-of-the-art attacks and defenses. Code is available at https://censor-gradient.github.io.

3.6CROct 7, 2025

Membership Inference Attacks on Tokenizers of Large Language Models

Meng Tong, Yuntao Du, Kejiang Chen et al.

Membership inference attacks (MIAs) are widely used to assess the privacy risks associated with machine learning models. However, when these attacks are applied to pre-trained large language models (LLMs), they encounter significant challenges, including mislabeled samples, distribution shifts, and discrepancies in model size between experimental and real-world settings. To address these limitations, we introduce tokenizers as a new attack vector for membership inference. Specifically, a tokenizer converts raw text into tokens for LLMs. Unlike full models, tokenizers can be efficiently trained from scratch, thereby avoiding the aforementioned challenges. In addition, the tokenizer's training data is typically representative of the data used to pre-train LLMs. Despite these advantages, the potential of tokenizers as an attack vector remains unexplored. To this end, we present the first study on membership leakage through tokenizers and explore five attack methods to infer dataset membership. Extensive experiments on millions of Internet samples reveal the vulnerabilities in the tokenizers of state-of-the-art LLMs. To mitigate this emerging risk, we further propose an adaptive defense. Our findings highlight tokenizers as an overlooked yet critical privacy threat, underscoring the urgent need for privacy-preserving mechanisms specifically designed for them.

3.6CRSep 8, 2025

Imitative Membership Inference Attack

Yuntao Du, Yuetian Chen, Hanshen Xiao et al.

A Membership Inference Attack (MIA) assesses how much a target machine learning model reveals about its training data by determining whether specific query instances were part of the training set. State-of-the-art MIAs rely on training hundreds of shadow models that are independent of the target model, leading to significant computational overhead. In this paper, we introduce Imitative Membership Inference Attack (IMIA), which employs a novel imitative training technique to strategically construct a small number of target-informed imitative models that closely replicate the target model's behavior for inference. Extensive experimental results demonstrate that IMIA substantially outperforms existing MIAs in various attack settings while only requiring less than 5% of the computational cost of state-of-the-art approaches.

17.0CRJun 24, 2021

DPSyn: Experiences in the NIST Differential Privacy Data Synthesis Challenges

Ninghui Li, Zhikun Zhang, Tianhao Wang

We summarize the experience of participating in two differential privacy competitions organized by the National Institute of Standards and Technology (NIST). In this paper, we document our experiences in the competition, the approaches we have used, the lessons we have learned, and our call to the research community to further bridge the gap between theory and practice in DP research.

28.3CRDec 30, 2020

PrivSyn: Differentially Private Data Synthesis

Zhikun Zhang, Tianhao Wang, Ninghui Li et al.

In differential privacy (DP), a challenging problem is to generate synthetic datasets that efficiently capture the useful information in the private data. The synthetic dataset enables any task to be done without privacy concern and modification to existing algorithms. In this paper, we present PrivSyn, the first automatic synthetic data generation method that can handle general tabular datasets (with 100 attributes and domain size $>2^{500}$). PrivSyn is composed of a new method to automatically and privately identify correlations in the data, and a novel method to generate sample data from a dense graphic model. We extensively evaluate different methods on multiple datasets to demonstrate the performance of our method.

17.0CRSep 14, 2020Code

Answering Multi-Dimensional Range Queries under Local Differential Privacy

Jianyu Yang, Tianhao Wang, Ninghui Li et al.

In this paper, we tackle the problem of answering multi-dimensional range queries under local differential privacy. There are three key technical challenges: capturing the correlations among attributes, avoiding the curse of dimensionality, and dealing with the large domains of attributes. None of the existing approaches satisfactorily deals with all three challenges. Overcoming these three challenges, we first propose an approach called Two-Dimensional Grids (TDG). Its main idea is to carefully use binning to partition the two-dimensional (2-D) domains of all attribute pairs into 2-D grids that can answer all 2-D range queries and then estimate the answer of a higher dimensional range query from the answers of the associated 2-D range queries. However, in order to reduce errors due to noises, coarse granularities are needed for each attribute in 2-D grids, losing fine-grained distribution information for individual attributes. To correct this deficiency, we further propose Hybrid-Dimensional Grids (HDG), which also introduces 1-D grids to capture finer-grained information on distribution of each individual attribute and combines information from 1-D and 2-D grids to answer range queries. To make HDG consistently effective, we provide a guideline for properly choosing granularities of grids based on an analysis of how different sources of errors are impacted by these choices. Extensive experiments conducted on real and synthetic datasets show that HDG can give a significant improvement over the existing approaches.

24.2CRDec 2, 2019

Estimating Numerical Distributions under Local Differential Privacy

Zitao Li, Tianhao Wang, Milan Lopuhaä-Zwakenberg et al.

When collecting information, local differential privacy (LDP) relieves the concern of privacy leakage from users' perspective, as user's private information is randomized before sent to the aggregator. We study the problem of recovering the distribution over a numerical domain while satisfying LDP. While one can discretize a numerical domain and then apply the protocols developed for categorical domains, we show that taking advantage of the numerical nature of the domain results in better trade-off of privacy and utility. We introduce a new reporting mechanism, called the square wave SW mechanism, which exploits the numerical nature in reporting. We also develop an Expectation Maximization with Smoothing (EMS) algorithm, which is applied to aggregated histograms from the SW mechanism to estimate the original distributions. Extensive experiments demonstrate that our proposed approach, SW with EMS, consistently outperforms other methods in a variety of utility metrics.

4.9CRNov 24, 2019

Improving Frequency Estimation under Local Differential Privacy

Milan Lopuhaä-Zwakenberg, Zitao Li, Boris Škorić et al.

Local Differential Privacy protocols are stochastic protocols used in data aggregation when individual users do not trust the data aggregator with their private data. In such protocols there is a fundamental tradeoff between user privacy and aggregator utility. In the setting of frequency estimation, established bounds on this tradeoff are either nonquantitative, or far from what is known to be attainable. In this paper, we use information-theoretical methods to significantly improve established bounds. We also show that the new bounds are attainable for binary inputs. Furthermore, our methods lead to improved frequency estimators, which we experimentally show to outperform state-of-the-art methods.

14.0CROct 17, 2019

Information-theoretic metrics for Local Differential Privacy protocols

Milan Lopuhaä-Zwakenberg, Boris Škorić, Ninghui Li

Local Differential Privacy (LDP) protocols allow an aggregator to obtain population statistics about sensitive data of a userbase, while protecting the privacy of the individual users. To understand the tradeoff between aggregator utility and user privacy, we introduce new information-theoretic metrics for utility and privacy. Contrary to other LDP metrics, these metrics highlight the fact that the users and the aggregator are interested in fundamentally different domains of information. We show how our metrics relate to $\varepsilon$-LDP, the \emph{de facto} standard privacy metric, giving an information-theoretic interpretation to the latter. Furthermore, we use our metrics to quantitatively study the privacy-utility tradeoff for a number of popular protocols.

24.9CRMay 20, 2019

Locally Differentially Private Frequency Estimation with Consistency

Tianhao Wang, Milan Lopuhaä-Zwakenberg, Zitao Li et al.

Local Differential Privacy (LDP) protects user privacy from the data collector. LDP protocols have been increasingly deployed in the industry. A basic building block is frequency oracle (FO) protocols, which estimate frequencies of values. While several FO protocols have been proposed, the design goal does not lead to optimal results for answering many queries. In this paper, we show that adding post-processing steps to FO protocols by exploiting the knowledge that all individual frequencies should be non-negative and they sum up to one can lead to significantly better accuracy for a wide range of tasks, including frequencies of individual values, frequencies of the most frequent values, and frequencies of subsets of values. We consider 10 different methods that exploit this knowledge differently. We establish theoretical relationships between some of them and conducted extensive experimental evaluations to understand which methods should be used for different query tasks.

34.7CRMar 5, 2016

Understanding the Sparse Vector Technique for Differential Privacy

Min Lyu, Dong Su, Ninghui Li

The Sparse Vector Technique (SVT) is a fundamental technique for satisfying differential privacy and has the unique quality that one can output some query answers without apparently paying any privacy cost. SVT has been used in both the interactive setting, where one tries to answer a sequence of queries that are not known ahead of the time, and in the non-interactive setting, where all queries are known. Because of the potential savings on privacy budget, many variants for SVT have been proposed and employed in privacy-preserving data mining and publishing. However, most variants of SVT are actually not private. In this paper, we analyze these errors and identify the misunderstandings that likely contribute to them. We also propose a new version of SVT that provides better utility, and introduce an effective technique to improve the performance of SVT. These enhancements can be applied to improve utility in the interactive setting. Through both analytical and experimental comparisons, we show that, in the non-interactive setting (but not the interactive setting), the SVT technique is unnecessary, as it can be replaced by the Exponential Mechanism (EM) with better accuracy.

10.9CRApr 22, 2015

Differentially Private Projected Histograms of Multi-Attribute Data for Classification

Dong Su, Jianneng Cao, Ninghui Li

In this paper, we tackle the problem of constructing a differentially private synopsis for the classification analyses. Several the state-of-the-art methods follow the structure of existing classification algorithms and are all iterative, which is suboptimal due to the locally optimal choices and the over-divided privacy budget among many sequentially composed steps. Instead, we propose a new approach, PrivPfC, a new differentially private method for releasing data for classification. The key idea is to privately select an optimal partition of the underlying dataset using the given privacy budget in one step. Given one dataset and the privacy budget, PrivPfC constructs a pool of candidate grids where the number of cells of each grid is under a data-aware and privacy-budget-aware threshold. After that, PrivPfC selects an optimal grid via the exponential mechanism by using a novel quality function which minimizes the expected number of misclassified records on which a histogram classifier is constructed using the published grid. Finally, PrivPfC injects noise into each cell of the selected grid and releases the noisy grid as the private synopsis of the data. If the size of the candidate grid pool is larger than the processing capability threshold set by the data curator, we add a step in the beginning of PrivPfC to prune the set of attributes privately. We introduce a modified $χ^2$ quality function with low sensitivity and use it to evaluate an attribute's relevance to the classification label variable. Through extensive experiments on real datasets, we demonstrate PrivPfC's superiority over the state-of-the-art methods.