LG DBOct 6, 2023

From Zero to Hero: Detecting Leaked Data through Synthetic Data Injection and Model Querying

Biao Wu, Qiang Huang, Anthony K. H. Tung

arXiv:2310.04145v22.0h-index: 12

Originality Incremental advance

AI Analysis

This addresses data leakage detection for data owners in machine learning, offering a novel solution to protect intellectual property, though it is incremental in extending to regression tasks.

The paper tackles the problem of detecting unauthorized use of leaked tabular data in classification models by introducing LDSS, a method that injects synthetic data with local distribution shifts to identify such models through querying, achieving reliable and robust results across multiple datasets and models.

Safeguarding the Intellectual Property (IP) of data has become critically important as machine learning applications continue to proliferate, and their success heavily relies on the quality of training data. While various mechanisms exist to secure data during storage, transmission, and consumption, fewer studies have been developed to detect whether they are already leaked for model training without authorization. This issue is particularly challenging due to the absence of information and control over the training process conducted by potential attackers. In this paper, we concentrate on the domain of tabular data and introduce a novel methodology, Local Distribution Shifting Synthesis (\textsc{LDSS}), to detect leaked data that are used to train classification models. The core concept behind \textsc{LDSS} involves injecting a small volume of synthetic data--characterized by local shifts in class distribution--into the owner's dataset. This enables the effective identification of models trained on leaked data through model querying alone, as the synthetic data injection results in a pronounced disparity in the predictions of models trained on leaked and modified datasets. \textsc{LDSS} is \emph{model-oblivious} and hence compatible with a diverse range of classification models. We have conducted extensive experiments on seven types of classification models across five real-world datasets. The comprehensive results affirm the reliability, robustness, fidelity, security, and efficiency of \textsc{LDSS}. Extending \textsc{LDSS} to regression tasks further highlights its versatility and efficacy compared with baseline methods.

View on arXiv PDF

Similar