CLAILGDec 24, 2021

Counterfactual Memorization in Neural Language Models

arXiv:2112.12938v2193 citations
Originality Incremental advance
AI Analysis

This addresses the risk of memorizing sensitive information in NLP applications, providing a method to trace memorization sources, though it is incremental as it builds on existing memorization studies.

The paper tackles the problem of distinguishing 'common' memorization from sensitive data leakage in neural language models by introducing a notion of counterfactual memorization that measures how predictions change if a document is omitted during training, and they apply this to identify and estimate the influence of memorized examples in standard datasets.

Modern neural language models that are widely used in various NLP tasks risk memorizing sensitive information from their training data. Understanding this memorization is important in real world applications and also from a learning-theoretical perspective. An open question in previous studies of language model memorization is how to filter out "common" memorization. In fact, most memorization criteria strongly correlate with the number of occurrences in the training set, capturing memorized familiar phrases, public knowledge, templated texts, or other repeated data. We formulate a notion of counterfactual memorization which characterizes how a model's predictions change if a particular document is omitted during training. We identify and study counterfactually-memorized training examples in standard text datasets. We estimate the influence of each memorized training example on the validation set and on generated texts, showing how this can provide direct evidence of the source of memorization at test time.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes