Harsh Chaudhari

LG
h-index52
19papers
601citations
Novelty55%
AI Score58

19 Papers

LGAug 25, 2022Code
SNAP: Efficient Extraction of Private Properties with Poisoning

Harsh Chaudhari, John Abascal, Alina Oprea et al. · eth-zurich

Property inference attacks allow an adversary to extract global properties of the training dataset from a machine learning model. Such attacks have privacy implications for data owners sharing their datasets to train machine learning models. Several existing approaches for property inference attacks against deep neural networks have been proposed, but they all rely on the attacker training a large number of shadow models, which induces a large computational overhead. In this paper, we consider the setting of property inference attacks in which the attacker can poison a subset of the training dataset and query the trained target model. Motivated by our theoretical analysis of model confidences under poisoning, we design an efficient property inference attack, SNAP, which obtains higher attack success and requires lower amounts of poisoning than the state-of-the-art poisoning-based property inference attack by Mahloujifar et al. For example, on the Census dataset, SNAP achieves 34% higher success rate than Mahloujifar et al. while being 56.5x faster. We also extend our attack to infer whether a certain property was present at all during training and estimate the exact proportion of a property of interest efficiently. We evaluate our attack on several properties of varying proportions from four datasets and demonstrate SNAP's generality and effectiveness. An open-source implementation of SNAP can be found at https://github.com/johnmath/snap-sp23.

CRMay 20, 2022
SafeNet: The Unreasonable Effectiveness of Ensembles in Private Collaborative Learning

Harsh Chaudhari, Matthew Jagielski, Alina Oprea

Secure multiparty computation (MPC) has been proposed to allow multiple mutually distrustful data owners to jointly train machine learning (ML) models on their combined data. However, by design, MPC protocols faithfully compute the training functionality, which the adversarial ML community has shown to leak private information and can be tampered with in poisoning attacks. In this work, we argue that model ensembles, implemented in our framework called SafeNet, are a highly MPC-amenable way to avoid many adversarial ML attacks. The natural partitioning of data amongst owners in MPC training allows this approach to be highly scalable at training time, provide provable protection from poisoning attacks, and provably defense against a number of privacy attacks. We demonstrate SafeNet's efficiency, accuracy, and resilience to poisoning on several machine learning datasets and models trained in end-to-end and transfer learning scenarios. For instance, SafeNet reduces backdoor attack success significantly, while achieving $39\times$ faster training and $36 \times$ less communication than the four-party MPC framework of Dalskov et al. Our experiments show that ensembling retains these benefits even in many non-iid settings. The simplicity, cheap setup, and robustness properties of ensembling make it a strong first choice for training ML models privately in MPC.

CRMay 22
PoisonForge: Task-Level Targeted Poisoning Benchmark for Instruction-Tuned LLMs

Luze Sun, Anshuman Suri, Harsh Chaudhari et al.

When practitioners fine-tune LLMs on unvetted datasets, an adversary can exploit the data supply chain through task-level poisoning: inserting a small number of crafted instruction-response pairs that cause the model to embed attacker-specified entities, such as a country, in outputs for a targeted task family while behaving normally elsewhere. We introduce PoisonForge, a benchmark that parameterizes this threat along four dimensions (bias type, poisoning mode, appearance count, and target output length) and evaluates 12 open-weight models (from 2B to 32B parameters) across five families under a primarily 1% poison budget. With only 10 poisoned examples among 1,000 fine-tuning examples, 11 of 12 models exceed a 70% attack success rate (ASR) in their most vulnerable configuration. Meanwhile, unintended leakage to non-target tasks remains below 0.5%, and models perform well on standard benchmarks. We analyze in detail the factors contributing to attack success. We observe that multiple appearances of an entity increase the ASR, the optimal poisoning mode depends on the semantic structure of the target entity, and ASR drops monotonically with the task output length. A correlation analysis and risk prediction model confirm that poisoning design choices, rather than model scale, are the primary causes of attack success, and that these patterns generalize to predict attack success on new tasks. We release all configurations, pipelines, and analysis code to support reproducible comparisons.

CRSep 23, 2024
UTrace: Poisoning Forensics for Private Collaborative Learning

Evan Rose, Hidde Lycklama, Harsh Chaudhari et al.

Privacy-preserving machine learning (PPML) systems enable multiple data owners to collaboratively train models without revealing their raw, sensitive data by leveraging cryptographic protocols such as secure multi-party computation (MPC). While PPML offers strong privacy guarantees, it also introduces new attack surfaces: malicious data owners can inject poisoned data into the training process without being detected, thus undermining the integrity of the learned model. Although recent defenses, such as private input validation within MPC, can mitigate some specific poisoning strategies, they remain insufficient, particularly in preventing stealthy or distributed attacks. As the robustness of PPML remains an open challenge, strengthening trust in these systems increasingly necessitates post-hoc auditing mechanisms that instill accountability. In this paper we present UTrace, a framework for user-level traceback in PPML that attributes integrity failures to responsible data owners without compromising the privacy guarantees of MPC. UTrace encapsulates two mechanisms: a gradient similarity method that identifies suspicious update patterns linked to poisoning, and a user-level unlearning technique that quantifies each user's marginal influence on model behavior. Together, these methods allow UTrace to attribute model misbehavior to specific users with high precision. We implement UTrace within an MPC-compatible training and auditing pipeline and evaluate its effectiveness on four datasets spanning vision, text, and malware. Across ten canonical poisoning attacks, UTrace consistently achieves high detection accuracy with low false positive rates.

LGOct 5, 2023
Chameleon: Increasing Label-Only Membership Leakage with Adaptive Poisoning

Harsh Chaudhari, Giorgio Severi, Alina Oprea et al.

The integration of machine learning (ML) in numerous critical applications introduces a range of privacy concerns for individuals who provide their datasets for model training. One such privacy risk is Membership Inference (MI), in which an attacker seeks to determine whether a particular data sample was included in the training dataset of a model. Current state-of-the-art MI attacks capitalize on access to the model's predicted confidence scores to successfully perform membership inference, and employ data poisoning to further enhance their effectiveness. In this work, we focus on the less explored and more realistic label-only setting, where the model provides only the predicted label on a queried sample. We show that existing label-only MI attacks are ineffective at inferring membership in the low False Positive Rate (FPR) regime. To address this challenge, we propose a new attack Chameleon that leverages a novel adaptive data poisoning strategy and an efficient query selection method to achieve significantly more accurate membership inference than existing label-only attacks, especially at low FPRs.

CVJan 14
Identifying Models Behind Text-to-Image Leaderboards

Ali Naseh, Yuefeng Peng, Anshuman Suri et al.

Text-to-image (T2I) models are increasingly popular, producing a large share of AI-generated images online. To compare model quality, voting-based leaderboards have become the standard, relying on anonymized model outputs for fairness. In this work, we show that such anonymity can be easily broken. We find that generations from each T2I model form distinctive clusters in the image embedding space, enabling accurate deanonymization without prompt control or training data. Using 22 models and 280 prompts (150K images), our centroid-based method achieves high accuracy and reveals systematic model-specific signatures. We further introduce a prompt-level distinguishability metric and conduct large-scale analyses showing how certain prompts can lead to near-perfect distinguishability. Our findings expose fundamental security flaws in T2I leaderboards and motivate stronger anonymization defenses.

CRJan 27
Thought-Transfer: Indirect Targeted Poisoning Attacks on Chain-of-Thought Reasoning Models

Harsh Chaudhari, Ethan Rathbun, Hanna Foerster et al.

Chain-of-Thought (CoT) reasoning has emerged as a powerful technique for enhancing large language models' capabilities by generating intermediate reasoning steps for complex tasks. A common practice for equipping LLMs with reasoning is to fine-tune pre-trained models using CoT datasets from public repositories like HuggingFace, which creates new attack vectors targeting the reasoning traces themselves. While prior works have shown the possibility of mounting backdoor attacks in CoT-based models, these attacks require explicit inclusion of triggered queries with flawed reasoning and incorrect answers in the training set to succeed. Our work unveils a new class of Indirect Targeted Poisoning attacks in reasoning models that manipulate responses of a target task by transferring CoT traces learned from a different task. Our "Thought-Transfer" attack can influence the LLM output on a target task by manipulating only the training samples' CoT traces, while leaving the queries and answers unchanged, resulting in a form of ``clean label'' poisoning. Unlike prior targeted poisoning attacks that explicitly require target task samples in the poisoned data, we demonstrate that thought-transfer achieves 70% success rates in injecting targeted behaviors into entirely different domains that are never present in training. Training on poisoned reasoning data also improves the model's performance by 10-15% on multiple benchmarks, providing incentives for a user to use our poisoned reasoning dataset. Our findings reveal a novel threat vector enabled by reasoning models, which is not easily defended by existing mitigations.

CLDec 30, 2023Code
L3Cube-MahaSocialNER: A Social Media based Marathi NER Dataset and BERT models

Harsh Chaudhari, Anuja Patil, Dhanashree Lavekar et al.

This work introduces the L3Cube-MahaSocialNER dataset, the first and largest social media dataset specifically designed for Named Entity Recognition (NER) in the Marathi language. The dataset comprises 18,000 manually labeled sentences covering eight entity classes, addressing challenges posed by social media data, including non-standard language and informal idioms. Deep learning models, including CNN, LSTM, BiLSTM, and Transformer models, are evaluated on the individual dataset with IOB and non-IOB notations. The results demonstrate the effectiveness of these models in accurately recognizing named entities in Marathi informal text. The L3Cube-MahaSocialNER dataset offers user-centric information extraction and supports real-time applications, providing a valuable resource for public opinion analysis, news, and marketing on social media platforms. We also show that the zero-shot results of the regular NER model are poor on the social NER test set thus highlighting the need for more social NER datasets. The datasets and models are publicly available at https://github.com/l3cube-pune/MarathiNLP

LGOct 25, 2024
Measuring memorization in language models via probabilistic extraction

Jamie Hayes, Marika Swanberg, Harsh Chaudhari et al. · deepmind

Large language models (LLMs) are susceptible to memorizing training data, raising concerns about the potential extraction of sensitive information at generation time. Discoverable extraction is the most common method for measuring this issue: split a training example into a prefix and suffix, then prompt the LLM with the prefix, and deem the example extractable if the LLM generates the matching suffix using greedy sampling. This definition yields a yes-or-no determination of whether extraction was successful with respect to a single query. Though efficient to compute, we show that this definition is unreliable because it does not account for non-determinism present in more realistic (non-greedy) sampling schemes, for which LLMs produce a range of outputs for the same prompt. We introduce probabilistic discoverable extraction, which, without additional cost, relaxes discoverable extraction by considering multiple queries to quantify the probability of extracting a target sequence. We evaluate our probabilistic measure across different models, sampling schemes, and training-data repetitions, and find that this measure provides more nuanced information about extraction risk compared to traditional discoverable extraction.

CRFeb 1, 2025
Riddle Me This! Stealthy Membership Inference for Retrieval-Augmented Generation

Ali Naseh, Yuefeng Peng, Anshuman Suri et al.

Retrieval-Augmented Generation (RAG) enables Large Language Models (LLMs) to generate grounded responses by leveraging external knowledge databases without altering model parameters. Although the absence of weight tuning prevents leakage via model parameters, it introduces the risk of inference adversaries exploiting retrieved documents in the model's context. Existing methods for membership inference and data extraction often rely on jailbreaking or carefully crafted unnatural queries, which can be easily detected or thwarted with query rewriting techniques common in RAG systems. In this work, we present Interrogation Attack (IA), a membership inference technique targeting documents in the RAG datastore. By crafting natural-text queries that are answerable only with the target document's presence, our approach demonstrates successful inference with just 30 queries while remaining stealthy; straightforward detectors identify adversarial prompts from existing methods up to ~76x more frequently than those generated by our attack. We observe a 2x improvement in TPR@1%FPR over prior inference attacks across diverse RAG configurations, all while costing less than $0.02 per document inference.

LGOct 10, 2025
The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

Milad Nasr, Nicholas Carlini, Chawin Sitawarin et al. · eth-zurich

How should we evaluate the robustness of language model defenses? Current defenses against jailbreaks and prompt injections (which aim to prevent an attacker from eliciting harmful knowledge or remotely triggering malicious actions, respectively) are typically evaluated either against a static set of harmful attack strings, or against computationally weak optimization methods that were not designed with the defense in mind. We argue that this evaluation process is flawed. Instead, we should evaluate defenses against adaptive attackers who explicitly modify their attack strategy to counter a defense's design while spending considerable resources to optimize their objective. By systematically tuning and scaling general optimization techniques-gradient descent, reinforcement learning, random search, and human-guided exploration-we bypass 12 recent defenses (based on a diverse set of techniques) with attack success rate above 90% for most; importantly, the majority of defenses originally reported near-zero attack success rates. We believe that future defense work must consider stronger attacks, such as the ones we describe, in order to make reliable and convincing claims of robustness.

LGMay 30, 2025
Cascading Adversarial Bias from Injection to Distillation in Language Models

Harsh Chaudhari, Jamie Hayes, Matthew Jagielski et al. · deepmind

Model distillation has become essential for creating smaller, deployable language models that retain larger system capabilities. However, widespread deployment raises concerns about resilience to adversarial manipulation. This paper investigates vulnerability of distilled models to adversarial injection of biased content during training. We demonstrate that adversaries can inject subtle biases into teacher models through minimal data poisoning, which propagates to student models and becomes significantly amplified. We propose two propagation modes: Untargeted Propagation, where bias affects multiple tasks, and Targeted Propagation, focusing on specific tasks while maintaining normal behavior elsewhere. With only 25 poisoned samples (0.25% poisoning rate), student models generate biased responses 76.9% of the time in targeted scenarios - higher than 69.4% in teacher models. For untargeted propagation, adversarial bias appears 6x-29x more frequently in student models on unseen tasks. We validate findings across six bias types (targeted advertisements, phishing links, narrative manipulations, insecure coding practices), various distillation methods, and different modalities spanning text and code generation. Our evaluation reveals shortcomings in current defenses - perplexity filtering, bias detection systems, and LLM-based autorater frameworks - against these attacks. Results expose significant security vulnerabilities in distilled models, highlighting need for specialized safeguards. We propose practical design principles for building effective adversarial bias mitigation strategies.

CRSep 6, 2025
Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated

Hanna Foerster, Ilia Shumailov, Yiren Zhao et al. · deepmind

Early research into data poisoning attacks against Large Language Models (LLMs) demonstrated the ease with which backdoors could be injected. More recent LLMs add step-by-step reasoning, expanding the attack surface to include the intermediate chain-of-thought (CoT) and its inherent trait of decomposing problems into subproblems. Using these vectors for more stealthy poisoning, we introduce ``decomposed reasoning poison'', in which the attacker modifies only the reasoning path, leaving prompts and final answers clean, and splits the trigger across multiple, individually harmless components. Fascinatingly, while it remains possible to inject these decomposed poisons, reliably activating them to change final answers (rather than just the CoT) is surprisingly difficult. This difficulty arises because the models can often recover from backdoors that are activated within their thought processes. Ultimately, it appears that an emergent form of backdoor robustness is originating from the reasoning capabilities of these advanced LLMs, as well as from the architectural separation between reasoning and final answer generation.

LGJul 11, 2025
Exploiting Leaderboards for Large-Scale Distribution of Malicious Models

Anshuman Suri, Harsh Chaudhari, Yuefeng Peng et al.

While poisoning attacks on machine learning models have been extensively studied, the mechanisms by which adversaries can distribute poisoned models at scale remain largely unexplored. In this paper, we shed light on how model leaderboards -- ranked platforms for model discovery and evaluation -- can serve as a powerful channel for adversaries for stealthy large-scale distribution of poisoned models. We present TrojanClimb, a general framework that enables injection of malicious behaviors while maintaining competitive leaderboard performance. We demonstrate its effectiveness across four diverse modalities: text-embedding, text-generation, text-to-speech and text-to-image, showing that adversaries can successfully achieve high leaderboard rankings while embedding arbitrary harmful functionalities, from backdoors to bias injection. Our findings reveal a significant vulnerability in the machine learning ecosystem, highlighting the urgent need to redesign leaderboard evaluation mechanisms to detect and filter malicious (e.g., poisoned) models, while exposing broader security implications for the machine learning community regarding the risks of adopting models from unverified sources.

CLMay 19, 2025
R1dacted: Investigating Local Censorship in DeepSeek's R1 Language Model

Ali Naseh, Harsh Chaudhari, Jaechul Roh et al.

DeepSeek recently released R1, a high-performing large language model (LLM) optimized for reasoning tasks. Despite its efficient training pipeline, R1 achieves competitive performance, even surpassing leading reasoning models like OpenAI's o1 on several benchmarks. However, emerging reports suggest that R1 refuses to answer certain prompts related to politically sensitive topics in China. While existing LLMs often implement safeguards to avoid generating harmful or offensive outputs, R1 represents a notable shift - exhibiting censorship-like behavior on politically charged queries. In this paper, we investigate this phenomenon by first introducing a large-scale set of heavily curated prompts that get censored by R1, covering a range of politically sensitive topics, but are not censored by other models. We then conduct a comprehensive analysis of R1's censorship patterns, examining their consistency, triggers, and variations across topics, prompt phrasing, and context. Beyond English-language queries, we explore censorship behavior in other languages. We also investigate the transferability of censorship to models distilled from the R1 language model. Finally, we propose techniques for bypassing or removing this censorship. Our findings reveal possible additional censorship integration likely shaped by design choices during training or alignment, raising concerns about transparency, bias, and governance in language model deployment.

LGOct 7, 2025
Text-to-Image Models Leave Identifiable Signatures: Implications for Leaderboard Security

Ali Naseh, Anshuman Suri, Yuefeng Peng et al.

Generative AI leaderboards are central to evaluating model capabilities, but remain vulnerable to manipulation. Among key adversarial objectives is rank manipulation, where an attacker must first deanonymize the models behind displayed outputs -- a threat previously demonstrated and explored for large language models (LLMs). We show that this problem can be even more severe for text-to-image leaderboards, where deanonymization is markedly easier. Using over 150,000 generated images from 280 prompts and 19 diverse models spanning multiple organizations, architectures, and sizes, we demonstrate that simple real-time classification in CLIP embedding space identifies the generating model with high accuracy, even without prompt control or historical data. We further introduce a prompt-level separability metric and identify prompts that enable near-perfect deanonymization. Our results indicate that rank manipulation in text-to-image leaderboards is easier than previously recognized, underscoring the need for stronger defenses.

CLDec 3, 2023
On Significance of Subword tokenization for Low Resource and Efficient Named Entity Recognition: A case study in Marathi

Harsh Chaudhari, Anuja Patil, Dhanashree Lavekar et al.

Named Entity Recognition (NER) systems play a vital role in NLP applications such as machine translation, summarization, and question-answering. These systems identify named entities, which encompass real-world concepts like locations, persons, and organizations. Despite extensive research on NER systems for the English language, they have not received adequate attention in the context of low resource languages. In this work, we focus on NER for low-resource language and present our case study in the context of the Indian language Marathi. The advancement of NLP research revolves around the utilization of pre-trained transformer models such as BERT for the development of NER models. However, we focus on improving the performance of shallow models based on CNN, and LSTM by combining the best of both worlds. In the era of transformers, these traditional deep learning models are still relevant because of their high computational efficiency. We propose a hybrid approach for efficient NER by integrating a BERT-based subword tokenizer into vanilla CNN/LSTM models. We show that this simple approach of replacing a traditional word-based tokenizer with a BERT-tokenizer brings the accuracy of vanilla single-layer models closer to that of deep pre-trained models like BERT. We show the importance of using sub-word tokenization for NER and present our study toward building efficient NLP systems. The evaluation is performed on L3Cube-MahaNER dataset using tokenizers from MahaBERT, MahaGPT, IndicBERT, and mBERT.

LGDec 5, 2019
Trident: Efficient 4PC Framework for Privacy Preserving Machine Learning

Harsh Chaudhari, Rahul Rachuri, Ajith Suresh

Machine learning has started to be deployed in fields such as healthcare and finance, which propelled the need for and growth of privacy-preserving machine learning (PPML). We propose an actively secure four-party protocol (4PC), and a framework for PPML, showcasing its applications on four of the most widely-known machine learning algorithms -- Linear Regression, Logistic Regression, Neural Networks, and Convolutional Neural Networks. Our 4PC protocol tolerating at most one malicious corruption is practically efficient as compared to the existing works. We use the protocol to build an efficient mixed-world framework (Trident) to switch between the Arithmetic, Boolean, and Garbled worlds. Our framework operates in the offline-online paradigm over rings and is instantiated in an outsourced setting for machine learning. Also, we propose conversions especially relevant to privacy-preserving machine learning. The highlights of our framework include using a minimal number of expensive circuits overall as compared to ABY3. This can be seen in our technique for truncation, which does not affect the online cost of multiplication and removes the need for any circuits in the offline phase. Our B2A conversion has an improvement of $\mathbf{7} \times$ in rounds and $\mathbf{18} \times$ in the communication complexity. The practicality of our framework is argued through improvements in the benchmarking of the aforementioned algorithms when compared with ABY3. All the protocols are implemented over a 64-bit ring in both LAN and WAN settings. Our improvements go up to $\mathbf{187} \times$ for the training phase and $\mathbf{158} \times$ for the prediction phase when observed over LAN and WAN.

CRDec 5, 2019
ASTRA: High Throughput 3PC over Rings with Application to Secure Prediction

Harsh Chaudhari, Ashish Choudhury, Arpita Patra et al.

The concrete efficiency of secure computation has been the focus of many recent works. In this work, we present concretely-efficient protocols for secure $3$-party computation (3PC) over a ring of integers modulo $2^{\ell}$ tolerating one corruption, both with semi-honest and malicious security. Owing to the fact that computation over ring emulates computation over the real-world system architectures, secure computation over ring has gained momentum of late. Cast in the offline-online paradigm, our constructions present the most efficient online phase in concrete terms. In the semi-honest setting, our protocol requires communication of $2$ ring elements per multiplication gate during the {\it online} phase, attaining a per-party cost of {\em less than one element}. This is achieved for the first time in the regime of 3PC. In the {\it malicious} setting, our protocol requires communication of $4$ elements per multiplication gate during the online phase, beating the state-of-the-art protocol by $5$ elements. Realized with both the security notions of selective abort and fairness, the malicious protocol with fairness involves slightly more communication than its counterpart with abort security for the output gates {\em alone}. We apply our techniques from $3$PC in the regime of secure server-aided machine-learning (ML) inference for a range of prediction functions-- linear regression, linear SVM regression, logistic regression, and linear SVM classification. Our setting considers a model-owner with trained model parameters and a client with a query, with the latter willing to learn the prediction of her query based on the model parameters of the former. The inputs and computation are outsourced to a set of three non-colluding servers. Our constructions catering to both semi-honest and the malicious world, invariably perform better than the existing constructions.