LGCLMar 6, 2023

Data Portraits: Recording Foundation Model Training Data

arXiv:2303.03919v239 citationsh-index: 60
Originality Incremental advance
AI Analysis

This addresses transparency issues for AI researchers and practitioners by providing a lightweight method to inspect training data, though it is incremental as it builds on existing data sketching techniques.

The paper tackles the problem of opaque training data for foundation models by proposing Data Portraits, artifacts that record data to enable downstream inspection, showing that their tool costs only 3% of dataset size in overhead and can answer questions about test set leakage and model plagiarism.

Foundation models are trained on increasingly immense and opaque datasets. Even while these models are now key in AI system building, it can be difficult to answer the straightforward question: has the model already encountered a given example during training? We therefore propose a widespread adoption of Data Portraits: artifacts that record training data and allow for downstream inspection. First we outline the properties of such an artifact and discuss how existing solutions can be used to increase transparency. We then propose and implement a solution based on data sketching, stressing fast and space efficient querying. Using our tools, we document a popular language modeling corpus (The Pile) and a recently released code modeling dataset (The Stack). We show that our solution enables answering questions about test set leakage and model plagiarism. Our tool is lightweight and fast, costing only 3% of the dataset size in overhead. We release a live interface of our tools at https://dataportraits.org/ and call on dataset and model creators to release Data Portraits as a complement to current documentation practices.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes