CRLGJul 21, 2024

A General Framework for Data-Use Auditing of ML Models

arXiv:2407.15100v321 citationsh-index: 5
Originality Incremental advance
AI Analysis

This addresses the challenge of protecting data owners' rights in ML by providing a practical auditing tool, though it builds incrementally on existing methods.

The paper tackles the problem of auditing machine-learning models for unauthorized use of data in training, proposing a general framework that combines black-box membership inference with a sequential hypothesis test to detect data use with a quantifiable false-detection rate, and demonstrates effectiveness on image classifiers and foundation models.

Auditing the use of data in training machine-learning (ML) models is an increasingly pressing challenge, as myriad ML practitioners routinely leverage the effort of content creators to train models without their permission. In this paper, we propose a general method to audit an ML model for the use of a data-owner's data in training, without prior knowledge of the ML task for which the data might be used. Our method leverages any existing black-box membership inference method, together with a sequential hypothesis test of our own design, to detect data use with a quantifiable, tunable false-detection rate. We show the effectiveness of our proposed framework by applying it to audit data use in two types of ML models, namely image classifiers and foundation models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes