Déjà Vu Memorization in Vision-Language Models
This addresses data privacy and generalization concerns for VLM users by revealing and mitigating memorization risks.
The paper tackles the problem of measuring memorization in Vision-Language Models (VLMs), showing that models like OpenCLIP retain information about individual training objects beyond correlations or captions, with significant memorization even on 50M image-caption pairs. They demonstrate that text randomization reduces memorization while only moderately affecting downstream performance.
Vision-Language Models (VLMs) have emerged as the state-of-the-art representation learning solution, with myriads of downstream applications such as image classification, retrieval and generation. A natural question is whether these models memorize their training data, which also has implications for generalization. We propose a new method for measuring memorization in VLMs, which we call déjà vu memorization. For VLMs trained on image-caption pairs, we show that the model indeed retains information about individual objects in the training images beyond what can be inferred from correlations or the image caption. We evaluate déjà vu memorization at both sample and population level, and show that it is significant for OpenCLIP trained on as many as 50M image-caption pairs. Finally, we show that text randomization considerably mitigates memorization while only moderately impacting the model's downstream task performance.