IR AI HCMar 29, 2025

Delving into: the quantification of Ai-generated content on the internet (synthetic data)

arXiv:2504.08755v110 citationsh-index: 2

Originality Synthesis-oriented

AI Analysis

This addresses the challenge of quantifying synthetic data online, which is crucial for understanding information integrity, though it is incremental as it builds on existing keyword-based detection methods.

The paper tackled the problem of measuring AI-generated content on the internet by analyzing linguistic markers from ChatGPT, finding that at least 30% of text on active web pages originates from AI sources, with estimates approaching 40%.

While it is increasingly evident that the internet is becoming saturated with content created by generated Ai large language models, accurately measuring the scale of this phenomenon has proven challenging. By analyzing the frequency of specific keywords commonly used by ChatGPT, this paper demonstrates that such linguistic markers can effectively be used to esti-mate the presence of generative AI content online. The findings suggest that at least 30% of text on active web pages originates from AI-generated sources, with the actual proportion likely ap-proaching 40%. Given the implications of autophagous loops, this is a sobering realization.

View on arXiv PDF

Similar