CR LGJul 21, 2024

A General Framework for Data-Use Auditing of ML Models

Zonghao Huang, Neil Zhenqiang Gong, Michael K. Reiter

arXiv:2407.15100v313.323 citationsh-index: 5Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of protecting data owners' rights in ML by providing a practical auditing tool, though it builds incrementally on existing methods.

The paper tackles the problem of auditing machine-learning models for unauthorized use of data in training, proposing a general framework that combines black-box membership inference with a sequential hypothesis test to detect data use with a quantifiable false-detection rate, and demonstrates effectiveness on image classifiers and foundation models.

Auditing the use of data in training machine-learning (ML) models is an increasingly pressing challenge, as myriad ML practitioners routinely leverage the effort of content creators to train models without their permission. In this paper, we propose a general method to audit an ML model for the use of a data-owner's data in training, without prior knowledge of the ML task for which the data might be used. Our method leverages any existing black-box membership inference method, together with a sequential hypothesis test of our own design, to detect data use with a quantifiable, tunable false-detection rate. We show the effectiveness of our proposed framework by applying it to audit data use in two types of ML models, namely image classifiers and foundation models.

View on arXiv PDF Code

Similar