Shakila Mahjabin Tonni

LG
3papers
148citations
Novelty53%
AI Score29

3 Papers

LGSep 19, 2023
What Learned Representations and Influence Functions Can Tell Us About Adversarial Examples

Shakila Mahjabin Tonni, Mark Dras

Adversarial examples, deliberately crafted using small perturbations to fool deep neural networks, were first studied in image processing and more recently in NLP. While approaches to detecting adversarial examples in NLP have largely relied on search over input perturbations, image processing has seen a range of techniques that aim to characterise adversarial subspaces over the learned representations. In this paper, we adapt two such approaches to NLP, one based on nearest neighbors and influence functions and one on Mahalanobis distances. The former in particular produces a state-of-the-art detector when compared against several strong baselines; moreover, the novel use of influence functions provides insight into how the nature of adversarial example subspaces in NLP relate to those in image processing, and also how they differ depending on the kind of NLP task.

CLJun 28, 2024
IDT: Dual-Task Adversarial Attacks for Privacy Protection

Pedro Faustini, Shakila Mahjabin Tonni, Annabelle McIver et al.

Natural language processing (NLP) models may leak private information in different ways, including membership inference, reconstruction or attribute inference attacks. Sensitive information may not be explicit in the text, but hidden in underlying writing characteristics. Methods to protect privacy can involve using representations inside models that are demonstrated not to detect sensitive attributes or -- for instance, in cases where users might not trust a model, the sort of scenario of interest here -- changing the raw text before models can have access to it. The goal is to rewrite text to prevent someone from inferring a sensitive attribute (e.g. the gender of the author, or their location by the writing style) whilst keeping the text useful for its original intention (e.g. the sentiment of a product review). The few works tackling this have focused on generative techniques. However, these often create extensively different texts from the original ones or face problems such as mode collapse. This paper explores a novel adaptation of adversarial attack techniques to manipulate a text to deceive a classifier w.r.t one task (privacy) whilst keeping the predictions of another classifier trained for another task (utility) unchanged. We propose IDT, a method that analyses predictions made by auxiliary and interpretable models to identify which tokens are important to change for the privacy task, and which ones should be kept for the utility task. We evaluate different datasets for NLP suitable for different tasks. Automatic and human evaluations show that IDT retains the utility of text, while also outperforming existing methods when deceiving a classifier w.r.t privacy task.

LGFeb 17, 2020
Data and Model Dependencies of Membership Inference Attack

Shakila Mahjabin Tonni, Dinusha Vatsalan, Farhad Farokhi et al.

Machine learning (ML) models have been shown to be vulnerable to Membership Inference Attacks (MIA), which infer the membership of a given data point in the target dataset by observing the prediction output of the ML model. While the key factors for the success of MIA have not yet been fully understood, existing defense mechanisms such as using L2 regularization \cite{10shokri2017membership} and dropout layers \cite{salem2018ml} take only the model's overfitting property into consideration. In this paper, we provide an empirical analysis of the impact of both the data and ML model properties on the vulnerability of ML techniques to MIA. Our results reveal the relationship between MIA accuracy and properties of the dataset and training model in use. In particular, we show that the size of shadow dataset, the class and feature balance and the entropy of the target dataset, the configurations and fairness of the training model are the most influential factors. Based on those experimental findings, we conclude that along with model overfitting, multiple properties jointly contribute to MIA success instead of any single property. Building on our experimental findings, we propose using those data and model properties as regularizers to protect ML models against MIA. Our results show that the proposed defense mechanisms can reduce the MIA accuracy by up to 25\% without sacrificing the ML model prediction utility.