The Ghost Couple: Correlated LLM Name Priors and Their Haunting of the Web and Academic Publishing
For researchers and publishers, it reveals a concrete, large-scale contamination of scholarly metadata by LLM-generated content, with measurable impact (1,655 records, 991 in one month).
The paper identifies that LLMs generate correlated fictional character ensembles (e.g., Elena Vasquez and Marcus Chen) across many AI-generated documents, and documents 1,655 ghost-authored records on Zenodo with fabricated journals and backdated DOIs, showing that these name priors contaminate academic repositories.
These names do not exist. Elena Vasquez and Marcus Chen have appeared as volcano experts, astronauts, thriller protagonists, podcast hosts, and academic co-authors across hundreds of independently produced AI-generated documents, never having lived. We show that large language models do not merely default to high-probability individual names when generating fictional experts: they produce correlated character ensembles, pairs and trios whose co-occurrence rates far exceed chance and are consistent across independent generations. These priors are model-family-specific (Claude: Elena Vasquez + Marcus Chen + Amara Okafor; Gemini: Aris Thorne + Lena Petrova; GPT: Elara Voss with no fixed partner), version-specific, and actively suppressed at model release boundaries, leaving dateable behavioral fingerprints in the content they produced. We document a downstream consequence at scale. On Zenodo, a CERN-operated repository that mints real DataCite DOIs, we identify 1,655 ghost-authored records claiming nonexistent journals with fabricated publication dates: server-side DataCite timestamps prove deliberate backdating, and 991 records were registered in a single month; these carry real DOIs registered in DataCite, making them harvestable by any scholarly aggregator that ingests DOI metadata. Ghost names additionally appear on ResearchGate forming synthetic research groups with collaborators drawn from multiple model families; publication dates on these records provide a reliable temporal proxy for model deployment windows.