CVAILGFeb 24, 2025

DIS-CO: Discovering Copyrighted Content in VLMs Training Data

Berkeley
arXiv:2502.17358v32 citationsh-index: 24Has CodeICML
Originality Incremental advance
AI Analysis

This addresses copyright infringement concerns for model developers and content creators, though it is an incremental improvement over existing detection approaches.

The authors tackled the problem of verifying whether copyrighted content was used to train vision-language models without access to training data, and their DIS-CO method nearly doubled the average AUC of prior methods for detection.

How can we verify whether copyrighted content was used to train a large vision-language model (VLM) without direct access to its training data? Motivated by the hypothesis that a VLM is able to recognize images from its training corpus, we propose DIS-CO, a novel approach to infer the inclusion of copyrighted content during the model's development. By repeatedly querying a VLM with specific frames from targeted copyrighted material, DIS-CO extracts the content's identity through free-form text completions. To assess its effectiveness, we introduce MovieTection, a benchmark comprising 14,000 frames paired with detailed captions, drawn from films released both before and after a model's training cutoff. Our results show that DIS-CO significantly improves detection performance, nearly doubling the average AUC of the best prior method on models with logits available. Our findings also highlight a broader concern: all tested models appear to have been exposed to some extent to copyrighted content. Our code and data are available at https://github.com/avduarte333/DIS-CO

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes