Masoumeh Shafieinejad

CR
Semantic Scholar Profile
h-index17
9papers
36citations
Novelty55%
AI Score54

9 Papers

86.6LGMar 19Code
MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-models-based Synthetic Tabular data

Masoumeh Shafieinejad, Xi He, Mahshid Alinoori et al.

Synthetic data is often perceived as a silver-bullet solution to data anonymization and privacy-preserving data publishing. Drawn from generative models like diffusion models, synthetic data is expected to preserve the statistical properties of the original dataset while remaining resilient to privacy attacks. Recent developments of diffusion models have been effective on a wide range of data types, but their privacy resilience, particularly for tabular formats, remains largely unexplored. MIDST challenge sought a quantitative evaluation of the privacy gain of synthetic tabular data generated by diffusion models, with a specific focus on its resistance to membership inference attacks (MIAs). Given the heterogeneity and complexity of tabular data, multiple target models were explored for MIAs, including diffusion models for single tables of mixed data types and multi-relational tables with interconnected constraints. MIDST inspired the development of novel black-box and white-box MIAs tailored to these target diffusion models as a key outcome, enabling a comprehensive evaluation of their privacy efficacy. The MIDST GitHub repository is available at https://github.com/VectorInstitute/MIDST

CRFeb 10
CAPID: Context-Aware PII Detection for Question-Answering Systems

Mariia Ponomarenko, Sepideh Abedini, Masoumeh Shafieinejad et al.

Detecting personally identifiable information (PII) in user queries is critical for ensuring privacy in question-answering systems. Current approaches mainly redact all PII, disregarding the fact that some of them may be contextually relevant to the user's question, resulting in a degradation of response quality. Large language models (LLMs) might be able to help determine which PII are relevant, but due to their closed source nature and lack of privacy guarantees, they are unsuitable for sensitive data processing. To achieve privacy-preserving PII detection, we propose CAPID, a practical approach that fine-tunes a locally owned small language model (SLM) that filters sensitive information before it is passed to LLMs for QA. However, existing datasets do not capture the context-dependent relevance of PII needed to train such a model effectively. To fill this gap, we propose a synthetic data generation pipeline that leverages LLMs to produce a diverse, domain-rich dataset spanning multiple PII types and relevance levels. Using this dataset, we fine-tune an SLM to detect PII spans, classify their types, and estimate contextual relevance. Our experiments show that relevance-aware PII detection with a fine-tuned SLM substantially outperforms existing baselines in span, relevance and type accuracy while preserving significantly higher downstream utility under anonymization.

57.2LGMay 12
FERMI: Exploiting Relations for Membership Inference Against Tabular Diffusion Models

Abtin Mahyar, Masoumeh Shafieinejad, Yuhan Liu et al.

Diffusion models are the leading approach for tabular data synthesis and are increasingly used to share sensitive records. Whether they actually protect privacy has become a pressing question. Membership inference attacks are the standard tool for this purpose, yet existing attacks assume a single-table setting and ignore the multi-relational structure of real sensitive data. A core challenge in assessing privacy risks from membership inference attacks in multi-table settings is how to leverage auxiliary information from relations associated with the target table, such as its parent tables. Particularly, we study a practical setting in which such auxiliary information is available only when training the attack model. At inference time, the attacker observes only the attribute values of the target record from the target table. We propose FERMI (FEature-mapping for Relational Membership Inference), which resolves this gap by enriching single-table features with relational membership signal. Across three tabular diffusion architectures and three real-world relational datasets, FERMI consistently improves attack performance over single-table baselines, with TPR@$0.1$FPR rising by up to 53% over the single-table baseline in the white-box setting and 22% in the black-box setting.

46.9LGMay 7
On Privacy Leakage in Tabular Diffusion Models: Influential Factors, Attacker Knowledge, and Metrics

Masoumeh Shafieinejad, D. B. Emerson, Behnoosh Zamanlooy et al.

Tabular data plays an important role in many fields and industries, including those with elevated privacy considerations and risks. As such, there is a rising interest in generating high-quality synthetic proxies for real tabular data as a means of reducing privacy risk and proprietary data exposure. With tabular diffusion models (TDMs) demonstrating leading performance in synthesizing such data, understanding and measuring the privacy risks associated with these models is imperative. Leveraging state-of-the-art membership inference attacks for TDMs in both black- and white-box settings, this work quantifies the impact of training setup, synthesis choices, and attacker knowledge on privacy leakage. Moreover, the results demonstrate that adversaries need not have perfect knowledge of the training setup, identical data distributions, or massive compute resources to construct successful attacks. Finally, the pitfalls associated with applying heuristic privacy metrics, such as distance-to-closest record, are revealed.

CRSep 27, 2025
MaskSQL: Safeguarding Privacy for LLM-Based Text-to-SQL via Abstraction

Sepideh Abedini, Shubhankar Mohapatra, D. B. Emerson et al.

Large language models (LLMs) have shown promising performance on tasks that require reasoning, such as text-to-SQL, code generation, and debugging. However, regulatory frameworks with strict privacy requirements constrain their integration into sensitive systems. State-of-the-art LLMs are also proprietary, costly, and resource-intensive, making local deployment impractical. Consequently, utilizing such LLMs often requires sharing data with third-party providers, raising privacy concerns and risking noncompliance with regulations. Although fine-tuned small language models (SLMs) can outperform LLMs on certain tasks and be deployed locally to mitigate privacy concerns, they underperform on more complex tasks such as text-to-SQL translation. In this work, we introduce MaskSQL, a text-to-SQL framework that utilizes abstraction as a privacy protection mechanism to mask sensitive information in LLM prompts. Unlike redaction, which removes content entirely, or generalization, which broadens tokens, abstraction retains essential information while discarding unnecessary details, striking an effective privacy-utility balance for the text-to-SQL task. Moreover, by providing mechanisms to control the privacy-utility tradeoff, MaskSQL facilitates adoption across a broader range of use cases. Our experimental results show that MaskSQL outperforms leading SLM-based text-to-SQL models and achieves performance approaching state-of-the-art LLM-based models, while preserving privacy.

CRMay 27, 2021
On Privacy and Confidentiality of Communications in Organizational Graphs

Masoumeh Shafieinejad, Huseyin Inan, Marcello Hasegawa et al.

Machine learned models trained on organizational communication data, such as emails in an enterprise, carry unique risks of breaching confidentiality, even if the model is intended only for internal use. This work shows how confidentiality is distinct from privacy in an enterprise context, and aims to formulate an approach to preserving confidentiality while leveraging principles from differential privacy. The goal is to perform machine learning tasks, such as learning a language model or performing topic analysis, using interpersonal communications in the organization, while not learning about confidential information shared in the organization. Works that apply differential privacy techniques to natural language processing tasks usually assume independently distributed data, and overlook potential correlation among the records. Ignoring this correlation results in a fictional promise of privacy. Naively extending differential privacy techniques to focus on group privacy instead of record-level privacy is a straightforward approach to mitigate this issue. This approach, although providing a more realistic privacy-guarantee, is over-cautious and severely impacts model utility. We show this gap between these two extreme measures of privacy over two language tasks, and introduce a middle-ground solution. We propose a model that captures the correlation in the social network graph, and incorporates this correlation in the privacy calculations through Pufferfish privacy principles.

CRMar 10, 2021
Equi-Joins Over Encrypted Data for Series of Queries

Masoumeh Shafieinejad, Suraj Gupta, Jin Yang Liu et al.

Encryption provides a method to protect data outsourced to a DBMS provider, e.g., in the cloud. However, performing database operations over encrypted data requires specialized encryption schemes that carefully balance security and performance. In this paper, we present a new encryption scheme that can efficiently perform equi-joins over encrypted data with better security than the state-of-the-art. In particular, our encryption scheme reduces the leakage to equality of rows that match a selection criterion and only reveals the transitive closure of the sum of the leakages of each query in a series of queries. Our encryption scheme is provable secure. We implemented our encryption scheme and evaluated it over a dataset from the TPC-H benchmark.

DBMar 9, 2021
PCOR: Private Contextual Outlier Release via Differentially Private Search

Masoumeh Shafieinejad, Florian Kerschbaum, Ihab F. Ilyas

Outlier detection plays a significant role in various real world applications such as intrusion, malfunction, and fraud detection. Traditionally, outlier detection techniques are applied to find outliers in the context of the whole dataset. However, this practice neglects contextual outliers, that are not outliers in the whole dataset but in some specific neighborhoods. Contextual outliers are particularly important in data exploration and targeted anomaly explanation and diagnosis. In these scenarios, the data owner computes the following information: i) The attributes that contribute to the abnormality of an outlier (metric), ii) Contextual description of the outlier's neighborhoods (context), and iii) The utility score of the context, e.g. its strength in showing the outlier's significance, or in relation to a particular explanation for the outlier. However, revealing the outlier's context leaks information about the other individuals in the population as well, violating their privacy. We address the issue of population privacy violations in this paper, and propose a solution for the two main challenges. In this setting, the data owner is required to release a valid context for the queried record, i.e. a context in which the record is an outlier. Hence, the first major challenge is that the privacy technique must preserve the validity of the context for each record. We propose techniques to protect the privacy of individuals through a relaxed notion of differential privacy to satisfy this requirement. The second major challenge is applying the proposed techniques efficiently, as they impose intensive computation to the base algorithm. To overcome this challenge, we propose a graph structure to map the contexts to, and introduce differentially private graph search algorithms as efficient solutions for the computation problem caused by differential privacy techniques.

LGJun 18, 2019
On the Robustness of the Backdoor-based Watermarking in Deep Neural Networks

Masoumeh Shafieinejad, Jiaqi Wang, Nils Lukas et al.

Obtaining the state of the art performance of deep learning models imposes a high cost to model generators, due to the tedious data preparation and the substantial processing requirements. To protect the model from unauthorized re-distribution, watermarking approaches have been introduced in the past couple of years. We investigate the robustness and reliability of state-of-the-art deep neural network watermarking schemes. We focus on backdoor-based watermarking and propose two -- a black-box and a white-box -- attacks that remove the watermark. Our black-box attack steals the model and removes the watermark with minimum requirements; it just relies on public unlabeled data and a black-box access to the classification label. It does not need classification confidences or access to the model's sensitive information such as the training data set, the trigger set or the model parameters. The white-box attack, proposes an efficient watermark removal when the parameters of the marked model are available; our white-box attack does not require access to the labeled data or the trigger set and improves the runtime of the black-box attack up to seventeen times. We as well prove the security inadequacy of the backdoor-based watermarking in keeping the watermark undetectable by proposing an attack that detects whether a model contains a watermark. Our attacks show that a recipient of a marked model can remove a backdoor-based watermark with significantly less effort than training a new model and some other techniques are needed to protect against re-distribution by a motivated attacker.