LGMLFeb 21, 2023

On Provable Copyright Protection for Generative Models

arXiv:2302.10870v2123 citationsh-index: 96
Originality Highly original
AI Analysis

This addresses copyright concerns for users of generative models in AI, though it is incremental as it builds on existing model learning algorithms.

The paper tackles the problem of generative models potentially outputting copyrighted content from their training data by introducing a formal definition of near access-freeness (NAF) and proving bounds on the probability of such outputs, with experiments showing minimal degradation in quality while ensuring strong protections.

There is a growing concern that learned conditional generative models may output samples that are substantially similar to some copyrighted data $C$ that was in their training set. We give a formal definition of $\textit{near access-freeness (NAF)}$ and prove bounds on the probability that a model satisfying this definition outputs a sample similar to $C$, even if $C$ is included in its training set. Roughly speaking, a generative model $p$ is $\textit{$k$-NAF}$ if for every potentially copyrighted data $C$, the output of $p$ diverges by at most $k$-bits from the output of a model $q$ that $\textit{did not access $C$ at all}$. We also give generative model learning algorithms, which efficiently modify the original generative model learning algorithm in a black box manner, that output generative models with strong bounds on the probability of sampling protected content. Furthermore, we provide promising experiments for both language (transformers) and image (diffusion) generative models, showing minimal degradation in output quality while ensuring strong protections against sampling protected content.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes