CLAICRSep 20, 2024

Measuring Copyright Risks of Large Language Model via Partial Information Probing

arXiv:2409.13831v112 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses copyright risks for users and developers of LLMs, but it is incremental as it builds on existing methods for testing output-based infringement.

The paper tackled the problem of assessing copyright infringement risks in Large Language Models by testing their ability to generate content overlapping with copyrighted materials when given partial inputs, finding that LLMs can produce highly overlapping content.

Exploring the data sources used to train Large Language Models (LLMs) is a crucial direction in investigating potential copyright infringement by these models. While this approach can identify the possible use of copyrighted materials in training data, it does not directly measure infringing risks. Recent research has shifted towards testing whether LLMs can directly output copyrighted content. Addressing this direction, we investigate and assess LLMs' capacity to generate infringing content by providing them with partial information from copyrighted materials, and try to use iterative prompting to get LLMs to generate more infringing content. Specifically, we input a portion of a copyrighted text into LLMs, prompt them to complete it, and then analyze the overlap between the generated content and the original copyrighted material. Our findings demonstrate that LLMs can indeed generate content highly overlapping with copyrighted materials based on these partial inputs.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes