CLLGJul 14, 2021

Deduplicating Training Data Makes Language Models Better

arXiv:2107.06499v2872 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses data quality issues for language model developers, leading to more efficient training and reduced memorization, though it is an incremental improvement on existing methods.

The paper tackles the problem of near-duplicate examples and repetitive substrings in language modeling datasets, which cause models to copy over 1% of output verbatim from training data. By deduplicating datasets, they reduce memorized text emission by ten times, achieve the same or better accuracy with fewer training steps, and cut train-test overlap affecting over 4% of validation sets.

We find that existing language modeling datasets contain many near-duplicate examples and long repetitive substrings. As a result, over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the training data. We develop two tools that allow us to deduplicate training datasets -- for example removing from C4 a single 61 word English sentence that is repeated over 60,000 times. Deduplication allows us to train models that emit memorized text ten times less frequently and require fewer train steps to achieve the same or better accuracy. We can also reduce train-test overlap, which affects over 4% of the validation set of standard datasets, thus allowing for more accurate evaluation. We release code for reproducing our work and performing dataset deduplication at https://github.com/google-research/deduplicate-text-datasets.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes