CLApr 28, 2023

Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4

BerkeleyGeorgia Tech
arXiv:2305.00118v2200 citationsh-index: 12
Originality Incremental advance
AI Analysis

This reveals data contamination issues for researchers and practitioners in AI evaluation, advocating for open models with known training data.

The study inferred books memorized by ChatGPT/GPT-4 using a membership inference query, finding that these models memorize copyrighted materials based on web frequency, which skews performance in cultural analytics tasks.

In this work, we carry out a data archaeology to infer books that are known to ChatGPT and GPT-4 using a name cloze membership inference query. We find that OpenAI models have memorized a wide collection of copyrighted materials, and that the degree of memorization is tied to the frequency with which passages of those books appear on the web. The ability of these models to memorize an unknown set of books complicates assessments of measurement validity for cultural analytics by contaminating test data; we show that models perform much better on memorized books than on non-memorized books for downstream tasks. We argue that this supports a case for open models whose training data is known.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes