CYAILGFeb 21, 2025

Generative AI Training and Copyright Law

arXiv:2502.15858v17 citationsh-index: 5
Originality Synthesis-oriented
AI Analysis

This addresses legal risks for AI developers and researchers using copyrighted data, highlighting a critical gap in current copyright frameworks.

The paper argues that training generative AI models using web-scraped copyright-protected data may not be covered by 'fair use' in the USA or 'Text and Data Mining' exceptions in Europe, as it fundamentally differs from traditional data mining, and discusses how data memorization exacerbates copyright issues.

Training generative AI models requires extensive amounts of data. A common practice is to collect such data through web scraping. Yet, much of what has been and is collected is copyright protected. Its use may be copyright infringement. In the USA, AI developers rely on "fair use" and in Europe, the prevailing view is that the exception for "Text and Data Mining" (TDM) applies. In a recent interdisciplinary tandem-study, we have argued in detail that this is actually not the case because generative AI training fundamentally differs from TDM. In this article, we share our main findings and the implications for both public and corporate research on generative models. We further discuss how the phenomenon of training data memorization leads to copyright issues independently from the "fair use" and TDM exceptions. Finally, we outline how the ISMIR could contribute to the ongoing discussion about fair practices with respect to generative AI that satisfy all stakeholders.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes